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1.  Introduction 

In  the  standard  formulation  of  the  problem  in  the  modern  theory  of  optimal 
filtering,  the  signal  process  and  measurement  process  are  described  by  the  math- 
eniatical/statistical  model: 

a;(^+l)  =  x{l)  =  xi,  (1.1) 

y{t)  =  h{x{t),t)  +  €{t),  (1.2) 

where  x{t)  is  an  n-dimensional  stochastic  process;  y{t)  is  an  m-dimensional  stochas¬ 
tic  process;  Xi  is  a  Gaussian  random  vector,  ^{t)  and  e{t)  are  respectively  ni- 
dimensional  and  mx-dimensional  Gaussian  noise  processes  with  zero  means;  Xi, 
^{t)  and  £{t)  have  given  joint  probability  distributions;  and  f{x,t),  G{x,t)  and 
h{x,  t)  are  known  functions  with  such  appropriate  dimensions  and  properties  that 
(1.1)  and  (1.2)  describe  faithfully  the  evolutions  of  the  signal  and  measurement. 
The  problem  of  discrete-time  optimal  filtering  is  to  design  and  make  a  discrete¬ 
time  dynamic  system  that  inputs  y(t)  and  outputs  an  estimate  x{t)  of  x{t)  at  each 
time  t  =  1,2,  •  •  • ,  T,  which  estimate  minimizes  a  given  estimation  error  criterion. 
Here  T  is  a  positive  integer  or  infinity.  The  dynamic  system  is  called  an  optimal 
filter  with  respect  to  the  given  estimation  error  criterion.  The  dynamic  state  of 
the  optimal  filter  at  a  time  ti  must  carry  the  optimal  conditional  statistics  given 
all  the  measurements  y{t)  that  have  been  received  up  to  and  including  the  time 
ti  at  the  time  so  that  at  the  next  time  fi  -f-  1,  the  optimal  filter  will  receive  and 
process  y(ti  -f  1)  using  the  optimal  conditional  statistics  from  fi,  and  then  produce 
the  optimal  estimate  x{ti  -t- 1).  The  most  widely  used  estimation  error  criterion  is 
the  mean  square  error  criterion,  E[||  x{t)  —  x{t)  ||^],  where  E  and  ||  •  ||  denote  the 
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expectation  and  the  Euclidean  norm  respectively.  The  estimate  x{t)  that  mini¬ 
mizes  this  criterion  is  called  the  minimum  variance  estimate  or  the  least-square 
estimate. 

The  most  commonly  used  method  of  treating  such  a  problem  is  the  use  of  a 
Kalman  filter  (KF)  or  an  extended  Kalman  filter  (EKF).  A  detailed  description 
of  the  KF  and  EKF  (and  some  other  approximate  nonlinear  filters)  can  be  found 
in  e.g.,  [11],  and  [1].  The  KF  and  EKF  have  been  applied  to  a  wide  range  of  ar¬ 
eas  including  aircraft/ship  inertial  and  aided-inertial  navigation,  spacecraft  orbit 
determination,  satellite  attitude  estimation,  phase  array  radar  tracking,  nuclear 
power  plant  failure  detection,  power  station  control,  oceanographic  surveying, 
biomedical  engineering,  and  process  control.  Many  important  papers  on  the  ap¬ 
plication  of  the  KF  and  EKF  can  be  found  in  [24]. 

In  the  rare  cases  where  /  and  h  are  linear  functions  of  x{t)  and  G  does  not 
depend  on  x{t),  the  model,  (1.1)  and  (1.2),  is  called  the  linear- Gaussian  model. 
If  the  KF  is  used  for  a  linear-Gaussian  model,  the  resulting  estimate  x{t)  is  the 
minimum  variance  (or  the  least-squares)  estimate.  In  most  cases,  however,  the 
foregoing  linearity  conditions  on  /,  h  and  G  are  not  satisfied  and  the  EKF  is 
used.  At  each  time  point,  the  EKF,  which  is  a  suboptimal  approximate  filter, 
first  linearizes  /  and  G  at  the  estimated  value  of  x{t)  and  linearizes  h  at  the 
predicted  value  of  x{t  -t-  1).  Then  the  EKF  uses  the  KF  equations  to  update 
the  estimated  value  of  x{t  -f  1)  and  the  predicted  value  of  x{t  -f  2)  for  the  new 
measurement  y{t  -1-1).  By  iterating  the  linearization  and  estimation  a  certain 
number  of  times  or  until  convergence  at  each  time  point,  we  have  the  so-called 
iterated  EKF  (lEKF).  Since  both  the  EKF  and  lEKF  involve  linearization,  they 
are  not  optimal  filters.  In  fact,  when  either  the  random  driving  term  G(x(i))^(t) 
in  (1.1)  or  the  random  measurement  noise  £(t)  in  (1.2)  has  such  large  variances 
and  covariances  that  the  aforementioned  estimated  value  and  predicted  value  of 
the  signal  are  not  very  close  to  the  true  signal,  and/or  when  the  functions  /,  G 
and  h  are  not  very  smooth,  the  linearization  may  be  a  poor  approximation  and 
the  EKF  as  well  as  lEKF  may  yield  poor  estimates  or  even  fail  totally. 

This  shortcoming  of  the  EKF  and  lEKF  has  motivated  an  enormous  amount 
of  work  on  nonlinear  filtering  in  the  past  thirty  years  or  so.  But  the  results  have 
been  disappointing.  With  very  few,  if  any,  exceptions,  the  nonlinear  filtering 
results  have  been  confined  to  research  papers  and  textbooks.  The  EKF  and,  to  a 
much  less  extent,  the  lEKF  remain  as  the  standard  filters  for  estimating  stochastic 
signals. 

The  30- year  failure  is  believed  to  be  related  to  the  methodology  that  has  been 
used  since  R.  E.  Kalman  derived  the  KF  equations.  The  methodology  is  anal¬ 
ysis.  Starting  with  a  mathematical/statistical  model,  the  methodology  searches 
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for  a  solution  consisting  of  analytic  formulas  and/or  equations  that  describe  the 
structures  and  determine  the  parameters  of  the  filter.  In  the  process  of  searching, 
deductive  reasoning  is  used  and  many  assumptions  are  made  to  make  some  special 
cases  analytically  tractable.  In  fact,  the  KF  was  derived  under  the  assumptions 
that  /  and  h  are  linear  in  x(f),  G  does  not  depend  on  x{t)^  and  ^{t)  and  6(t) 
are  Gaussian  sequences.  The  model,  (1.1)  and  (1.2),  contains  such  assumptions 
as  the  Markov  property,  Gaussian  distribution,  and  additive  measurement  noise. 
When  enough  additional  assumptions  are  made  to  derive  explicit  filter  equations, 
these  assumptions  are  usually  so  restrictive  and/or  unrealistic  that  they  prevent 
the  filter  equations  from  much  real-world  application. 

When  not  enough  additional  assumptions  are  made,  the  analysis  involved  is  so 
deep  and  complicated  that  it  leads  mostly  to  mathematical  formulas  and  equations 
that  are  not  ready  for  designing  or  implementing  a  real  filter.  This  state  of  the  art 
is  reflected  in  [14]  and  [15].  In  the  few  cases  where  the  assumptions  are  not  so  bad 
and  the  explicit  filtering  algorithms  are  available,  these  filtering  algorithms  involve 
such  an  enormous  amount  of  computation  that  their  real-time  implementation  is 
prohibitively  expensive  if  not  impossible.  Some  examples  of  such  cases  can  be 
found  in  [2],  [16],  and  [17]. 

Because  of  the  inherent  inaccuracies  and  frequent  failures  of  the  EKF  and 
lEKF  and  the  restrictive  and  unrealistic  assumptions  and  prohibitive  computa¬ 
tional  requirements  of  other  existing  filters,  new  filters  are  needed  that  consistently 
yield  a  high  degree  of  estimation  accuracy  vis^d-vis  the  information  contained  in 
the  meeisurements  about  the  signal  and  that  can  be  applied  in  a  large  variety  of 
real-world  situations. 

Recent  years  have  seen  a  rapid  growth  in  the  development  of  artificial  neural 
networks  (ANNs),  which  are  also  known  as  connectionist  models,  parallel  dis¬ 
tributed  processors,  neuroprocessors,  and  neurocomputers.  Being  crude  mathe¬ 
matical  models  of  theorized  mind  and  brain  activity,  ANNs  exploit  the  massively 
parallel  processing  and  distributed  information  representation  properties  that  are 
believed  to  exist  in  a  brain.  A  good  introduction  to  ANNs  can  be  found  in  [7]  and 
[8]. 

There  is  a  large  number  of  ANN  paradigms  such  as  Hopfield  networks,  high- 
order  networks,  counterpropagation  networks,  bidirectional  associative  memories, 
piecewise  linear  machines,  neocognitrons,  self-organizing  feature  maps,  adaptive 
resonance  theory  networks,  Boltzmann  machines,  multilayer  perceptrons  (MLPs), 
MLPs  with  various  feedback  structures,  other  recurrent  neural  network  paradigms, 
etc.  These  and  other  ANN  paradigms  have  been  applied  to  systems  control  (e.g., 
[28]),  signal  processing  (e.g.,  [13]),  Speech  Processing  (e.g.,  [18]),  and  others  (e.g., 
[23]). 
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There  are  many  research  articles  concerning  applications  of  ANNs,  most  of 
which  can  be  found  in  the  foregoing  books,  journals  (e.g.,  IEEE  Transactions  on 
Neural  Networks,  Neural  Networks,  and  Neural  Computation),  and  Conference 
proceedings  (e.g.,  Proceedings  of  the  International  Joint  Conference  on  Neural 
Networks).  Application  of  one  of  the  aforementioned  neural  network  paradigms 
to  optimal  filtering  was  reported  in  [25].  The  signal  and  measurement  processes 
considered  therein  are  described  by  the  linear-Gaussian  model  and  the  neural  net¬ 
work  used  is  a  Hopfield  network  with  the  neural  activation  function  slightly  mod¬ 
ified.  The  connection  weights  and  neuron  biases  for  the  network  are  determined 
by  using  the  Kalman  filter  (KF)  equations  so  that  when  the  Hopfield  network 
stabilizes  at  each  time  point,  the  stable  state  is  the  minimum  variance  estimate. 
The  usefulness  of  the  method  is  very  limited,  because  it  can  only  be  applied  to  the 
linear-Gaussian  model  for  which  the  KF  equations  are  available  and  the  weights 
and  biases  of  the  Hopfield  network  need  to  be  updated  in  the  operation  of  the 
Hopfield  network  by  other  means  using  the  Kalman  filter  equations. 

During  the  reporting  period  of  time,  a  new  neural  network  approach  has  been 
developed.  As  opposed  to  the  analytic  methodology  used  in  the  conventional 
filtering  theory,  the  new  method  is  synthetic  in  nature.  The  method  synthesizes 
realizations  of  the  signal  process  and  measurement  process  into  a  filter,  which  is 
made  out  of  an  artificial  recurrent  neural  networks  (RNN).  We  shall  call  this  filter 
a  neural  filter. 

The  synthesis  is  performed  through  training  RNNs  using  a  set  of  training 
data  consisting  of  the  realizations  of  the  signal  and  measurement  processes.  More 
specifically,  the  weights  and/or  parameters  of  an  RNN  are  determined  by  min¬ 
imizing  a  training  criterion  by  the  variation  of  the  weights  and/or  parameters. 
The  training  criterion  incorporates  all  the  training  data  and  is  a  function  of  the 
weights  and/or  parameters  of  the  RNN  under  training. 

After  adequate  training,  the  neural  filter  is  a  recursive  filter  optimal  for  the 
given  RNN  architecture  with  the  lagged  feedbacks  carrying  the  optimal  condi¬ 
tional  statistics.  If  appropriate  RNN  paradigms  and  estimation  error  criteria  are 
selected,  the  neural  filter  of  such  a  paradigm  is  proven  to  approximate  an  optimal 
filter  in  performance  (with  respect  to  the  selected  estimation  error  criteria)  to  any 
desired  degree  of  accuracy,  provided  that  the  RNN  that  constitutes  the  neural 
filter  is  of  sufficiently  large  size. 

If  a  mathematical  model  of  the  signal  and  measurement  processes  such  as  (1.1) 
and  (1.2)  is  available,  the  realizations  of  the  signal  and  measurement  processes  are 
generated  by  computer  simulation.  Otherwise,  these  training  data  can  be  collected 
by  conducting  actual  experiments  with  the  signal  and  measurement  processes. 
Since  we  do  not  use  a  mathematical  model  to  derive  formulas  and  equations,  such 
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properties  as  the  Markov  property,  Gaussian  distribution  and  additive  noise  are 
not  required  of  the  signal  and  measurement  processes  for  the  present  invention 
to  apply.  In  fact,  the  new  approach  applies  to  virtually  any  signal  process  (to  be 
defined  in  the  sequel)  and  measurement  process. 

Extensive  numerical  work  was  carried  out  to  compare  the  neural  filters  and  the 
KF,  EKF  and  lEKF  for  signal  and  measurement  processes  that  can  be  described 
by  the  mathematical  model  (1.1)  and  (1.2).  The  numerical  results  show  that 
the  neural  filters  almost  equal  the  KF  in  performance  for  a  linear  model  in  both 
transient  and  steady  states  of  filtering  and  always  outperform  the  EKF  and  lEKF 
for  a  nonlinear  model,  when  there  is  sufficient  number  of  neurons  (or  processing 
elements)  in  the  neural  filter. 

This  report  is  organized  as  follows.  Chapter  2  is  concerned  with  optimal  fil¬ 
tering  by  multilayer  perceptrons  with  interconnected  hidden  neurons.  It  is  proven 
that  after  adequate  training,  such  an  RNN  can  approximate  an  optimal  filter  in 
performance  with  respect  to  the  mean  square  error  criterion  to  any  desired  degree 
of  accuracy,  provided  that  the  RNN  is  of  sufficient  size.  Numerical  results  are  given 
to  illustrate  the  theory.  In  Chapter  3,  analogous  theoretical  and  numerical  results 
are  presented  for  optimal  filtering  by  multilayer  perceptrons  with  output  feed¬ 
backs.  If  the  signal  and/or  measurement  processes  keep  increasing  in  magnitude, 
some  methods  for  keeping  the  sizes  of  the  neural  filter  and  the  training  data  set 
manageable  are  proposed  in  Chapter  4.  These  methods  of  using  range  extenders 
and  reducers  are  discussed  and  their  numerical  results  provided  therein  to  illus¬ 
trate  their  usefulness.  In  Chapter  5,  adaptive  filtering  is  treated  and  a  surprisingly 
simple  and  powerful  method  is  presented.  The  initial  numerical  results  that  have 
been  obtained  are  extremely  encouraging,  calling  for  more  research  on  this  topic. 
A  byproduct  of  the  reported  work  on  neural  filtering,  which  is  somewhat  incon- 
gruent  with  the  preceding  chapters,  is  the  higher-order  differentiation  formulas  for 
a  multilayer  perceptron  given  in  Chapter  6.  The  higher-order  derivatives,  which 
these  differentiation  formulas  calculate  recursively,  provide  interpretations  of  the 
training  data  and  suggest  possible  pruning  of  the  neural  network  connections  and 
neurons.  Chapter  7  is  a  brief  conclusion. 

2.  Optimal  Filtering  by  Multilayer  Perceptron  with  Inter¬ 
connected  Neurons 

This  chapter  was  inspired  by  the  recent  resurgence  of  the  study  on  artificial  neural 
networks,  especially  the  multilayer  perceptrons  (MLPs)  and  MLP-based  recurrent 
networks.  The  synaptic  weights  of  such  a  network  are  not  evaluated  by  formu¬ 
las  or  equations,  but  are  determined  by  training  the  network  using  the  desired 
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input/output  pairs[5,  12,  19,  22,  26],  This  training  is  synthesizing.  Of  course, 
we  do  not  have  to  use  these  neural  networks  for  synthesizing  filters.  Neverthe¬ 
less,  there  are  three  reasons  for  using  them:  First,  any  measurable  function  with 
a  compact  support  can  be  approximated  to  any  degree  of  accuracy  by  an  MLP 
with  even  a  single  layer  of  hidden  neurons[3,  6,  9],  Second,  the  feedbacks  in  a 
recurrent  network  provide  the  “state”  in  building  a  dynamical  system.  Third,  the 
massively  parallel  nature  of  MLPs  and  MLP-based  recurrent  networks  make  them 
most  suitable  for  real-time  filtering. 

To  simplify  our  discussion,  we  will,  in  this  chapter,  mainly  consider  an  MLP 
with  a  single  hidden  layer  of  neurons,  whose  states  are  fully[5,  21]  or  partially  fed 
back  to  one  another.  Our  results  can  be  easily  generalized  or  extended  to  many 
other  architectures. 

In  training  a  recurrent  MLP  into  a  filter,  the  measurement  sequence  y{t)  are 
used  as  the  inputs  consecutively  in  time  and  the  signal  sequence  t/?(x(Z))  are  the 
corresponding  desired  outputs.  A  requirement  here  that  is  not  needed  in  the 
modern  theory  of  optimal  filtering  is  that  x{t)  and  y{t)  stay  in  a  compact  set. 
The  fullfilment  of  it  is  easy  to  justify  in  the  real  world.  The  model,  (1.1)  and 
(1.2),  violates  this  requirement.  However,  it  is  only  an  idealization,  whose  x{t) 
and  y{t)  may  stray  out  of  a  compact  region  with  arbitrarily  small  probability 
provided  that  the  region  is  sufficiently  large. 

The  training  process  minimizes  the  mean  square  error  between  the  network 
outputs  ip[x[t))  and  the  desired  outputs  xp{x[t)).  Consequently,  after  adequate 
training,  the  estimates  'tp{x{t))  produced  by  the  network  are  minimum- variance 
for  the  given  network  architecture.  It  is  easy  to  see  that  the  mean  square  error 
between  xl)[x{t))  and  0(x(f))  after  adequate  training  decreases,  as  the  number  of 
hidden  neurons  in  the  recurrent  MLP  increases.  We  call  such  a  recurrent  MLP  a 
neural  filter.  The  main  theorem  in  this  paper  is  that  the  output  xl?{x{t))  of  a  neural 
filter  converges  to  the  minimum  variance  estimate,  which  is  also  the  conditional 
expectation,  E[il){x{t))\y^]^  as  the  number  of  fully  interconnected  hidden  neurons 
approaches  infinity. 

Four  numerical  examples  are  thoroughly  worked  out  and  reported.  Two  of 
them  show  that  even  the  neural  filters  with  a  few  hidden  neurons  consistently 
outperform  the  extended  Kalman  filter  and  the  iterated  extended  Kalman  filter 
for  nonlinear  systems  of  (1.1)  and  (1.2).  The  third  shows  that  a  neural  filter  with 
a  sufficient  number  of  neurons  is  indistinguishable  from  the  Kalman  filter  for  a 
linear  system  of  (1.1)  and  (1.2).  In  the  last  example,  neither  the  signal  nor  the 
measurement  can  be  described  by  (1.1)  or  (1.2).  Nevertheless,  the  neural  filter 
works  satisfactorily. 

A  close  look  at  the  proof  of  the  main  theorem  indicates  that  fully  interconnect- 
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edness  of  the  hidden  layer  is  not  necessary  for  the  convergence  to  the  minimum 
variance  filter.  Another  architecture  is  also  tested  in  the  numerical  examples.  The 
hidden  neurons  in  this  architecture  are  arranged  in  a  ring  and  each  is  fed  back  to 
its  two  neighbors  after  a  unit  time  delay.  As  it  stands,  the  main  theorem  does  not 
apply  to  this  architecture.  However,  in  all  the  four  examples,  the  filtering  perfor¬ 
mance  of  a  recurrent  MLP  with  this  architecture  is  very  close  to  that  of  a  recurrent 
MLP  with  fully  interconnected  hidden  neurons.  Notice  that  the  total  number  of 
connections  in  the  architecture  with  fully  interconnected  hidden  neurons  is  much 
greater  than  that  in  the  architecture  with  the  ring  topology,  especially  when  the 
number  of  neurons  is  large.  This  raises  the  question  whether  a  neural  filter  with 
ring-connected  neurons  can  approximate  the  minimum  variance  filter  to  any  de¬ 
gree  of  accuracy.  The  question  is  under  investigation  and  an  answer  will  soon  be 
reported. 

2.1.  MLPs  with  Interconnected  Hidden  Neurons 

A  typical  recurrent  MLP  (RMLP)  with  a  single  hidden  layer  of  fully  interconnected 
neurons  is  illustrated  in  Fig.  2.1.  It  has  2  input  variables  4  hidden  neurons, 
and  2  output  variables  where  t  indicates  the  dependence  on  time.  Denote 

the  weight  from  the  zth  input  to  neuron  j  by  the  weight  from  neuron  i  to  the 
jth  output  by  and  the  weight  for  the  lagged  feedback  from  neuron  i  to  neuron 
j  by  Wj>.  The  activation  level  /3j(t)  of  and  the  weighted  sum  7jj(t)  in  neuron  j 
satisfy 


=  (2.1) 

m  7 

1=1  1=1 

where  a  is  a  monotone  increasing  neuron  activation  function,  say  tanh.  The  output 
at(f)  is  then  determined  by 

for  i  =  1,  •  •  • ,  A:,  (2.3) 

J  =  1 

where  m,  q,  and  k  are  the  number  of  inputs,  neurons,  and  outputs,  respectively. 
It  is  assumed  in  this  paper  that  the  activation  levels  are  initialized  at  /?j(0)  =  0, 
i  =  I,--- ,9- 

A  typical  RMLP  with  a  single  hidden  layer  of  ring-connected  neurons  is  illus¬ 
trated  in  Fig.  2.2.  It  has  2  input  variables  yi{t),  5  hidden  neurons,  and  2  output 
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aid)  cx2(t) 


yid)  y2d) 


Figure  2.1;  A  recurrent  MLP  with  fully  interconnected  hidden  neurons 
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Figure  2.2:  A  recurrent  MLP  with  ring-connected  hidden  neurons 
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variables  Using  the  same  symbols  as  defined  above,  the  RMLP  is  specified 

by  the  equations,  (2.1),  (2.3),  and 

m 

i=l 

+^jp(i+i)/^pO+i)(^  “  1)’  (2-4) 

where  the  function  p  is  defined  by  p{j)  =  j,  if  1  <  j  <  q-  =  1,  if  j  =  ^  +  1;  =  9,  if 
j  =  0.  It  is  also  assumed  here  that  /9j(0)  =  0,  j  =  1,  •  •  • ,  9. 

As  discussed  in  the  preceding  section,  by  either  computer  simulation  or  ex¬ 
periments,  we  may  have  available  a  finite  set  of  signal/measurement  sequences, 
{x{t,u),y{t,uj)),  t  =  1,  -  ■  •  ,T,  uj  E  S.  It  is  assumed  that  the  finite  set  5  is  a  ran¬ 
dom  sample  from  the  sample  space  f)  and  adequately  reflects  the  joint  probability 
distributions  of  the  signal  and  measurement  prosesses  x{t)  and  y{t). 

Suppose  that  V’(a:(t))  =  [V’i(a:(t)),  •  •  • ,  ^k{x{t))f  are  what  we  want  to  estimate 
and  are  thus  the  desired  outputs  for  the  actual  outputs  •  •  • ,  ak{t).  Using  the 

m  components  of  the  measurement  vector  y{t)  =  [j/i(t),  •  •  • ,  ym{t)]^  as  the  inputs, 
the  training  data  consists  of  the  I/O  pairs  {y{t,u),  ip{x{t,u!))),  t  =  I,  -  ■  ■  ,T,uj  e  S. 

The  training  is  to  minimize,  by  the  variation  of  the  RMLP  weights  w,  the 
mean  square  error 

T  k 

^  ^  (2.5) 

’  u/65  t=l  1=1 

where  ^5  is  the  number  of  elements  in  S  and  ai{t,uj)  is  the  /th  actual  RMLP 
output  at  time  t  corresponding  to  the  input  sequence  t/(r, w),  r  =  We 

note  here  that  ai{t,u;)  is  a  function  of  j/(r, a;),  r  =  1,  •  •  •  ,t. 

A  good  training  method,  which  is  used  in  our  simulation,  is  first  to  evaluate 
the  gradient  of  the  above  error  function  with  respect  to  w  by  the  time-dependent 
recurrent  backpropagation  by  Werbos[27]  and  Pearlmutter[20]  and  then  to  use  it 
in  a  conjugate  gradient  algorithm  for  optimization. 

An  adequately  trained  network  through  minimizing  C{w)  will  be  called  a  neu¬ 
ral  filter. 

2.2.  Convergence  to  the  Minimum  Variance  Filter 

The  collection  of  all  A:- dimensional  random  vectors  x  =  [xi,  •  ■  •  ,Xk]'^  is  a  Hilbert 
space  L2  with  the  norm  £'[||x|p]  =  xf],  where  E  denotes  the  expectation 
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or  the  average  over  the  sample  space  with  the  <T-field  M  of  events  and  the 
probability  measure  P. 

By  simple  calculation,  we  have  for  any  Borel- measurable  function  /  of  j/*  := 

T 

t=i 
1  ^ 

+£:(||£[V'(x(())|!/'l-/(!/')iri},  (2.6) 

which  shows  that  the  conditional  expectation  E[ip{x{t))  \y^]  is  the  minimum  vari¬ 
ance  estimate  of  xl>{x{t))  given  yh 

Under  the  assumption  that  averaging  over  S  is  indistinguishable  from  taking 
expectation  over  Cl,  we  see,  by  comparing  C{u>)  in  (2.5)  and  C  in  (2.6),  that 

u,es 


+  r(-u:^  -  E[xP{x{t))\y\u) 


Training  amounts  to  minimizing  the  second  term  above.  The  objective  is,  of 
course,  to  minimize  (l/T)  £^[||a(t)  —  V’(a;(0)||^]- 

Recall  the  neural  filter  architecture  described  in  Sec.  2.  We  observe  that  adding 
a  neuron  to  a  hidden  layer  decreases  minu,  C{w).  We  are  now  ready  to  state  and 
prove  the  main  theorem  of  this  paper. 


Theorem.  Consider  the  n-dimensional  random  process  x{t)  and  the  m-dimensional 
random  process  y{t),  t  =  1,---,T  defined  on  a  probability  space  {Cl,A,P).  As¬ 
sume  that  the  range  {y{t,u)\t  =  u;  G  Cl}  C  HP  is  compact  and 

is  an  arbitrary  /^-dimensional  Borel  function  of  x{t)  with  finite  second  moments 
£[llV’((r(f))|p],  t  =  1,  •  •  • ,  r.  Let  a{t)  denote  the  /^-dimensional  output  at  time  t 
of  an  RMLP  which  hcis  taken  the  inputs,  y(l),  •  •  •  ,  y{t),  in  the  given  order. 

(a).  Given  e  >  0,  there  exist  an  RMLP  with  one  hidden  layer  of  fully  intercon¬ 
nected  neurons  such  that 

^J2E[\\a{t)  -  Emx(tW]\\^]  <  e. 

«=i 
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Figure  2.3:  A  special  recurrent  MLP  used  in  the  proof  of  the  main  theorem 

(b).  If  the  RMLP  has  one  hidden  layer  of  N  neurons  fully  interconnected  and 
the  output  a{t)  is  written  as  a(t;  N)  here  to  indicate  its  dependency  on  N, 
then  ^ 

r{N)  ■.=  min  i  T  £(||a((,  N)  -  £|.;.(x(())|!,‘]|n  (2.7) 

W  ±  ‘ 

t=l 

is  monotone  decreasing  and  converges  to  0  as  approaches  infinity. 

Proof.  Since  (a)  is  an  immediate  consequence  of  (b),  we  shall  only  prove  (b). 

It  is  obvious  that  r{N),  N  =  1,2,  •  •  •,  are  all  bounded  from  below  by  0.  Hence 
the  monotone  decreasing  sequence  r{N)  converges  and  we  denote  the  limit  by 
r(oo).  To  prove  that  r(oo)  =  0,  it  suffices  to  show  that  for  any  e  >  0,  there  is  an 
integer  M  such  that  |r(M)|  <  e.  This  will  be  shown  in  the  following  for  the  case 
in  which  the  measurement  process  y{t)  is  scalar-valued,  i.e.  m  =  1.  The  proof  for 
m  >  1  is  similar  and  omitted. 

As  illustrated  in  Fig.  2.3,  the  first  T  —I  neurons  can  be  so  interconnected  that 
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for  ^  =  0,  •  •  • ,  r  —  1,  /?i(f  +  1)  =  a{y{t  +  1))  and  for  r  =  2,  ■  •  • ,  T  —  1, 

/3r(i  +  1)  =  a(/?T-i(0)-  (2-8) 

These  neurons  are  initialized  at  /3t(0)  =  5,  T  =  l,---,r— 1,  where  5  is  a  point 
in  R}  outside  the  compact  set  {y{t,u)\t  =  1,  •  •  •  ,T,uj  G  fi}.  At  time  t  4-  1,  the 
lagged  feedbacks  from  them  are 

■  ■  ■ ,  /5t+i(0)  A-i(O)  ■  ■  ■  >  /^i(0) 

=  (B,  ■  ■  ■ ,  o''‘(!,(l)),  a”<-‘'(!/(2)),  •  ■  ■ ,  <>(!/(())), 

where  a°‘  denote  the  t-fold  composition  aoao---oaofa.  Including  both  these 
feedbacks  and  the  input  node,  we  have  the  T-dimensional  vector  at  time  t  +  1, 

+  1)  =  ■  1  y{i  +  1)]^) 

to  use  as  the  inputs  to  an  MLP  with  one  hidden  layer. 

Recall  that  £'[^(x(t  +  l))|a°‘(y(l)),  •  •  • ,  a{y{t)),  y(t  + 1)]  is  a  Borel 

measurable  function  of  a°*(j/(l)),  •  •  • ,  a{y{t))  and  y{t  +  1).  Let  us  denote  it  by 
At+i(a°*(y(l)),  •  •  • ,  a{y{t)),  y{t  +  1)).  Now  let  us  consider  the  mapping  f  :  ^ 

/I*  defined  by 


/(6>  •  •  •  I'fr) 

if  =  ...  =  ^  B 

and  Ai((f7’)  defined; 

A2(-fr-i,^r),  if  6  =  •  •  •  =  iT-2  =  B,(t-i  B, 

^  5  and  A2(^r-i,<^r)  defined; 

At(6>  •  •  •  .'fr),  if  6  7^  -6,  •  •  •  ,'fr  #  5  and 
Ar(6r--^<fT)  defined; 

0,  otherwise. 

Let  us  consider  also  the  measures  yt  defined  on  the  Borel  sets  in  R^  by,  denoting 
d^i  X  X  d^T  by  d^, 


^i{dO  = 

MdO  = 


^P(y(l)  G  d^r),  if  P  6  d^t-,  for  t  <  T,and 
B  ^  d^t,  for  t  r, 

0,  otherwise; 

Ap(a(y(l))  G  d^T-i,y(2)  €  d^T), 
if  B  G  d(t,  ioT  t  <T  —  I,  and 
if  B  ^  d<ft,  for  t  ^  T  —  I,  and 
0,  otherwise; 
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a{y{T  -  1))  G  d^T-i,y{T)  G  d^r)- 

Notice  that  /x  =  /ij  +  •  •  •  +  yj  is  also  a  mesure  on  the  Borel  sets  in  BF  and  that 

/  ll/(fi."-.«r)llV(<if) 

=  f{J  l|Ai({T)|lV.(<iO 

+  /  l|Ar(6.  ■  •  ■  >4r)|pMT(rfO} 

=  i{£[||i:(V'Wi))l!/(i)llPl  +  '" 

+£[||£lV’(i(r))|a“'’'-‘>(!,(l)),  •  •  • ,  !/(r)l||"]} 

<  i{B[|Wx(i))||"l  +  ---  +  e[IWx(r))n) 

<  oo. 

Since  y{t),  t  =  1,  •  •  • ,  T,  are  assumed  to  have  a  compact  range,  it  is  easy  to  see 
that  a°‘(j/(r)),  for  t  =  0,  •  •  • ,  T  -  1  and  r  =  1,  •  •  • ,  T,  all  have  compact  ranges. 
Hence  there  is  a  compact  set  W  in  such  that  =  1- 

Consider  now  the  normed  space  L2{R^,fJ')  with  the  norm  defined  by  [J  l|5'(^i, 
■■■)  ^r)||^  A;-dimensional  function  g  G  L2{R^,IJ>)-  By  Corol¬ 

lary  2.2  of  the  paper  [9]  by  Hornik,  Stinchcombe  and  White,  the  fc-dimensional 
functions  represented  by  all  the  MLPs  with  k  output  nodes  and  a  single  layer  of 
hidden  neurons  are  dense  in  L2{R'^,g)-  It  then  follows  that  for  any  e  >  0,  there 
is  such  an  MLP  that  the  function  that  the  MLP  represents  satisfies 

J  11/(6, •••, er)  - y(6, •••, ^t)|| <^- 

Let  us  translate  the  left  hand  side  of  this  inequality  from  L2{R^,g)  back  to 
the  probability  space  P): 

J  11/(6,  •  •  •  ,'fr)  -  i/(6,  •  •  •  ,'fT)||V(^^0 
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+  j  P2(fr-i,«T)-s(B,---,B,?r-i,fT)llV2(<iO 

H - 

+  J  .^J') -JKi,'"  ifT)||Vj'(<iO} 

-g{B,  ■  ■  ■ ,  B,  <.“C-='(!,(2)),  ■  ■  ■ .  !,(*))||") 

where  the  last  equality  holds,  because  the  <r-fields  generated  by  and 
ao(‘-2)  (j/(2)),  •  •  • ,  y{t)}  are  identical  due  to  the  mo’notonicity  of  a. 

We  notice  that  the  foregoing  approximation  MLP,  g{^i,  •  •  • ,  (t),  and  the  T  —  1 
neurons  (2.8)  discussed  earlier  form  an  RMLP  with  a  single  hidden  layer.  The 
number  N  of  hidden  neurons  of  this  RMLP  is  the  sum  of  that  of  the  MLP  and 
r  —  1.  Recalling  (2.7),  we  have 

1  ^ 

’■{N)  <  -  ^  B(||£(V’(2^(i))l!/‘l  -  s(B,  ■ .  ■ ,  B, 

t=l 

since  r(A^)  minimizes  the  error  function  in  (2.7)  by  the  variation  of  the  weights 
of  an  RMLP  with  exactly  the  same  architecture  as  for  the  RMLP  constructed 
earlier.  This  completes  the  proof.  □ 


In  real-world  application  of  a  filter,  the  time  interval  in  which  the  filter  is 
applied  is  usually  very  long,  i.e.  T  >  1.  Having  more  than  T  —  1  neurons  is 
impractical.  Nevertheless,  the  above  theorem  guarantees  you  may  get  arbitrarily 
close  to  the  minimum  variance  filter,  provided  that  a  sufficient  number  of  neurons 
are  used. 

2.3.  Differentiating  an  RMLP  with  Interconnected  Hidden  Neurons 

In  this  section,  we  shall  briefly  review  an  efficient  method  of  evaluating  the  gradi¬ 
ent  dCldw,  which  was  derived  and  called  the  time-dependent  recurrent  backprop- 
agation  (TDRBP)  by  Werbos[27]  and  Pearlmutter[20].  To  simplify  our  discussion. 
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let  us  consider  an  RMLP,  for  T  =  3,  with  a  single  input  y{t),  a  single  input  x{t), 
and  a  hidden  layer  of  q  fully  interconnected  neurons,  which  are  governed  by  (2.1), 
(2.2),  and  (2.3).  The  training  process  minimizes,  by  the  variation  of  w,  C{w)  in 
(2.5),  which  we  write  as 

1  T 

where  T{^S)/2  is  suppressed  for  simplification. 

Before  deriving  the  formulas  for  dC{w)/dw,  we  observe  that  a{t,u)  and  /3i{t,u), 
i  =  !,■  ■  •  ,q  are  functions  of  w,  y{t,uj),  and  0i{t  —  l,u;),  i  =  I,  -  •  •  ,q.  By  the  chain 
rule  of  dilferentiation,  we  have 


=  EE 


dC  da{t,u)) 
da{t,ui)  dw 


DdC  dQ:(l,tj)  V  dC  d^i{t  —  1,<^) 

^  5a(l,a;)  dw  da{t^uj)  dl5i{t  —  \^u)  dw 

DdC  da{l,uj)  dC  da{2,u)  5/3, (1,0;)  da{2,uj) 

5a(l,a;)  dw  5q!(2,u;)  5/3, (l,a;)  dw  dw 


da{t,u)  d/3i[t  —  l,tj) 


,  ^Q'(3,u;)  5/3j(2,a;)  5/3,(l,u;)  d/3j{2,u}) 

da{3,ujy^  d0,{2,ujy^  d/3iil,io)  '  dw  ^  dw 
da{3,w) 


_  da{t,uj)  5a(2,a>) 

^  ^  da{t,u})  dw  ^  da{2,uj)  d^i{l,uj) 


dC  da{3,u!)  d^j{2,u>)  /3,(l,u;) 

5a(3,t^)  ^  5/3, (2,u;)  '  5/3,(l,u;)^  '  dw 


dC 


5a(3,w)  d^i{2,u). 


da{3,u))  ^  d^i{2,u>)  dw 


—  f  da{t,u)  5'*'C  d/3i{t,  ui) ' 

■“  5a(f,w)  dw  dl3i{t,u)  dw  ’ 
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where  d'^C fd0i{t,u!)  is  called  an  ordered  partial  derivative  by  Werbos[27]  and 
denotes  the  partial  derivative  of  C  w.r.t.  taking  into  consideration  the 

dependency  of  a(t+l,  w)  on  ^,(t,  u).  We  notice  that  the  ordered  partial  derivatives 
satisfy  the  recurrent  relationship 

d+C  _  dC  da{t  +  l,u)  ^  d+C  dl3,{t  +  \,u) 
da{t  +  l,uj)  dj3i(t,u)  ^  +  l,a;) 

with  the  final  condition,  d'^C ld^i{T,u)  =  dCld/3i{T,u).  This  relationship  is  the 
TDRBP. 

The  formulas  (2.9)  and  (2.10)  constitute  a  very  efficient  technique  for  evalu¬ 
ating  the  gradient  dCfdw.  For  the  numerical  examples  in  the  next  section,  the 
RMLPs  are  trained  by  first  applying  TDRBP  and  then  optimizing  by  the  conju¬ 
gate  gradient  method. 

2.4.  Numerical  Examples 

Four  examples  are  worked  out.  The  first  two  show  that  both  a  neural  filter 
with  fully-interconnected  neurons  (NFFN)  and  that  with  ring-connected  neurons 
(NFRN)  outperform  the  extended  Kalman  filter  (EKF)  and  the  iterated  extended 
Kalman  filter  (lEKF)  for  nonlinear  systems  described  by  (1.1)  and  (1.2).  It  is  also 
shown  that  the  NFFNs  converge  rather  fast  for  the  example  systems.  By  the  main 
theorem,  they  convege  to  the  minimum  variance  filter.  The  third  shows  that  the 
performance  of  even  an  NFRN  is  indistinguishable  from  that  of  the  Kalman  filter 
for  a  linear  system.  In  the  last  example,  neither  the  signals  nor  the  measurements 
can  be  described  by  (1.1)  and  (1.2).  Nevertheless,  the  neural  filters  of  both  types 
work  satisfactorily. 

In  describing  the  signal/sensor  systems  in  the  following  examples,  the  symbols 
w[t)  and  v[t)  denote  statistically  independent  standard  white  Gaussian  sequence 
with  mean  E[w{t)]  =  E[v{t)]  =  0  and  variance  E[w{t)^]  =  E[v{ty]  =  1.  The 
training  data  for  each  example  are  obtained  by  simulating  the  system  and  consist 
of  batches  of  200  realizations  of  100  consecutive  measurements  and  signals.  A 
batch  is  used  in  a  training  session  of  100  epochs  (or  sweeps).  If  the  convergence 
of  the  weights  is  not  reached,  a  different  batch  is  used  in  another  training  session. 
Continuing  in  this  manner,  we  stop  the  training  whenever  the  convergence  is 
reached,  i.e.  the  change  in  the  mean  square  error  C{w)  is  less  than  10~'°C(ju). 

The  training  method  of  first  applying  TDRBP  to  evaluate  the  gradient  dC j dw 
and  optimize  C  by  the  conjugate  gradient  method  was  effective,  albeit  slow.  A  486- 
based  personal  computer  was  used  for  training.  On  the  average,  it  took  about  16 
hours  to  train  up  a  neural  filter  on  the  PC.  Local  minima  of  the  error  function  did 
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not  pose  a  serious  problem.  Only  three  training  sessions  resulted  in  unsatisfactory 
filters  during  our  entire  training  experience.  Restarting  with  a  new  set  of  random 
weights  always  resolved  the  problem. 

After  training,  500  Monte  Carlo  test  runs  were  performed  for  the  EKF,  lEKF, 
NFFN,  and/or  NFRN  that  are  considered  in  each  example.  The  RMSE  of  the 
estimates  x{t,u>)  produced  by  a  filter  at  time  t  for  the  500  Monte  Carlo  runs 
u>  =  1 ,  •  •  • ,  500  is  defined  by 

^  500 

[— ^(x(f,w)  - 

LJ  =  1 

Therefore  the  RMSE  for  a  filter  is  a  function  of  time.  The  RMSE  of  each  filter 
considered  versus  time  is  plotted  for  120  time  points.  In  the  last  example,  the 
true  signal  and  the  estimates  produced  by  NFFN  and  NFRN  for  a  single  run  are 
also  plotted  versus  time.  The  last  20  time  points  are  included  to  demonstrate  the 
generalization  ability  of  neural  filters. 

In  all  the  examples,  both  the  NFFN  and  NFRN  have  7  hidden  neurons.  Notice 
that  while  there  are  71  weights  in  the  NFFN,  there  are  only  36  weights  in  the 
NFRN. 

Example  1.  The  signal/sensor  system  is 

x(t  +  l)  =  l.lexp(-2x^(f))  -  1  +  0.5io(t), 

y{t)  =  x^{t)  +  0.lv{t), 

where  x(0)  is  Gaussian  with  mean  —0.5  and  variance  0.1^.  Note  that  x{t  +  1)  = 
1.1  exp(— 2x^(t))  —  1  has  a  global  attractor  at  x{t)  =  0.0844. 

The  EKF  fails  badly.  The  RMSEs  versus  time  for  the  lEKF,  NFFN,  and 
NFRN  are  shown  in  Fig.  2.4.  They  are  plotted  by  different  lines  as  specified  in 
the  caption  of  the  figure.  We  note  that  the  RMSEs  for  the  NFFN  and  NFRN  are 
virtually  the  same.  In  fact  the  RMSEs  of  the  NFRN  are  only  worse  than  those  of 
the  NFFN  by  0.1%  over  the  120  time  points.  The  RMSEs  for  NFFNs  with  2,  4, 
and  8  neurons  are  plotted  versustime  in  Fig.  2.5.  We  notice  that  they  converge 
rapidly. 

Example  2.  The  signal/sensor  system  is 

x(f  +  l)  =  1.7exp(-2x2(f))  -  1 +0.1u;(f), 

y{t)  =  x^(<)  +  O.lu(f), 

where  x(0)  is  Gaussian  with  mean  zero  and  variance  0.5^.  Note  that  x{t  +  1)  = 
1.7  exp(— 2x^(t))  —  1  is  a  chaotic  process. 
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Figure  2.4:  RMSEs  versus  time  of  lEKF  (•  •  •),  NFFN  (— )  and  NFRN  (— )  for 
Example  1.  The  RMSEs  of  them  over  120  time  points  are  0.2806,  0.2120  and 
0.2122,  respectively. 


RMSE 


time 


Figure  2.5:  RMSEs  versus  time  of  NFFNs  with  2  neurons  (•••)»  4  neurons  ( — ), 
and  8  neurons  ( — )  for  Example  1.  The  RMSEs  of  them  over  120  time  points  are 
0.2181,  0.2121  and  0.2114,  respectively. 
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Figure  2.6:  RMSEs  versus  time  of  EKF  ( — ),  lEKF  (•  •  •),  NFFN  ( — ),  and  NFRN 
( — )  for  Example  2.  The  RMSEs  of  them  over  120  time  points  are  0.3373,  0.2778, 
0.1779  and  0.1804,  respectively. 


The  RMSEs  versus  time  for  the  EKF,  lEKF,  NFFN  and  NFRN  are  shown  in 
Fig.  2.6.  We  note  that  the  RMSEs  of  the  NFRN  are  slightly  worse  than  those  of 
the  NFFN  by  1.42%.  The  RMSEs  for  NFFNs  with  2,  4  and  8  neurons  are  plotted 
versus  time  in  Fig.  2.7.  Again  they  converge  rapidly. 

Example  3.  The  signal/sensor  system  is 

x(i  +  l)  =  0.9x(t)  +  0.2ri;(i), 
y{t)  =  x{t)  +  v{t), 

where  x(0)  is  Gaussian  with  mean  zero  and  variance  1^. 

The  RMSEs  versus  time  for  the  Kalman  filter  and  NFRN  are  shown  in  Fig.  2.8. 
We  note  that  the  two  lines  are  virtually  the  same. 
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Figure  2.7:  RMSEs  versus  time  of  NFFNs  with  2  neurons  (•  •  •),  4  neurons  (— ), 
and  8  neurons  (— )  for  Example  2.  The  RMSEs'of  them  over  120  time  points  are 
0.1897,  0.1788  and  0.1777,  respectively. 
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Figure  2.8:  RMSEs  versus  time  of  the  Kalman  filter  (— )  and  NFRN  ( — )  for 
Example  3.  The  RMSEs  of  them  over  120  time  points  are  0.3549  and  0.3563, 
respectively. 


signal 


F30602-91-C-0033  Maryland  Technology  Corporation 


Figure  2.9:  The  true  signal  ( — )  and  its  estimates  produced  by  NFFN  ( — )  and 
NFRN  (•  •  •)  versus  time  for  Example  4. 


Example  4.  The  signal/sensor  system  is 

x{t  +  1)  =  0.5x(t)  +  0.5  tanh(a;(t)  +  0.5u>(t)), 
y{t)  =  x(t)  +  0.5a;^(t)u(t), 

where  x(0)  is  Gaussian  with  mean  zero  and  variance  0.5^.  Note  that  neither  the 
signal  nor  the  measurement  process  can  be  transformed  into  (1.1)  or  (1.2). 

The  true  signal  and  its  estimates  produced  by  the  NFFN  and  NFRN  for  a 
single  run  are  shown  in  Fig.  2.9.  The  RMSEs  versus  time  for  the  NFFN  and 
NFRN  are  plotted  in  Fig.  2.10.  We  note  that  the  RMSEs  of  the  NFRN  are  worse 
than  those  of  the  NFFN  by  0.84%. 
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3.  Optimal  Filtering  by  Multilayer  Perceptrons  with  Out¬ 
put  Feedbacks 

In  this  chapter,  we  will  show  how  an  MLP  with  output  feedbacks  can  be  synthe¬ 
sized  into  a  good  filter.  The  synthesis  is  performed  through  the  neural  network 
training,  in  which  yt  are  used  as  the  neural  network  inputs  and  Xt  as  the  desired 
outputs.  If  a  system  model  such  as  (1.1)  and  (1.2)  is  available,  the  training  data  yt 
and  Xt,  are  easily  generated  by  computer  simulation.  Otherwise,  the  experimental 
data  can  be  used  instead.  The  training  data  are  all  that  we  need  to  synthesize 
the  filter. 

A  requirement  here  that  is  not  needed  in  the  modern  theory  of  optimal  filtering 
is  that  Xt  and  yt  stay  in  a  compact  set.  The  fullfilment  of  it  is  easy  to  justify  in 
the  real  world.  The  model,  (1.1)  and  (1.2),  violates  this  requirement.  However,  it 
is  only  an  idealization,  whose  Xt  and  yt  may  stray  out  of  a  compact  region  with 
arbitrarily  small  probability  provided  that  the  region  is  sufficiently  large.  Other 
than  the  requirement,  no  assumption  such  as  the  Markov  property,  the  Gaussian 
distribution,  or  the  linear  dynamics  is  necessary. 

The  training  process  minimizes  the  mean  square  error  between  the  network 
outputs  il>{xt)  and  the  desired  outputs  ipixt).  Consequently,  after  adequate  train¬ 
ing,  the  estimates  produced  by  the  network  are  minimum- variance  for  the  given 
network  architecture.  Furthermore,  it  will  be  proven  that  when  the  size  of  network 
increases,  the  network  converges  to  the  minimum  variance  filter. 

There  are  two  types  of  feedback,  namely  teacher-forced  feedbacks  and  free 
feedbacks.  From  the  optimization  point  of  view,  teacher  forcing  amounts  to  con¬ 
straints.  From  the  statistical  point  of  view,  free  feedbacks  are  allowed  to  record 
the  most  informative  statistics  at  each  time  instant  for  the  best  estimation  at  the 
next  time  instant.  Therefore,  a  free  feedback  is  always  better  than  a  teacher-forced 
feedback,  if  the  application  condition  allows  its  use.  This  point  is  borne  out  by 
our  limited  simulation  results. 

Four  numerical  examples  are  worked  out  and  reported  in  this  paper.  Two  of 
them  show  that  the  neural  filters  consistently  outperform  the  extended  Kalman 
filter  and  the  iterated  Kalman  filter.  The  third  shows  that  the  performance  of 
the  neural  filter  is  virtually  indistinguishable  from  that  of  the  Kalman  filter  for  a 
linear  system  of  (1.1)  and^(1.2).  In  the  last  example,  neither  the  signal  nor  the 
measurement  can  be  described  by  (1.1)  or  (1.2).  Nevertheless,  the  neural  filter 
works  satisfactorily. 
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'/i  7^2  y3 


Figure  3.1:  A  typical  multilayer  perceptron  (MLP). 

3.1.  Multilayer  Perceptrons  as  Neural  Filters 

A  typical  multilayer  perceptron  (MLP)  is  illustrated  in  Fig.  3.1.  It  has  two  input 
nodes,  three  neurons  in  the  first  hidden  layer, two  neurons  in  the  second  hidden 
layer  and  three  output  nodes.  Let  the  weighted  sum  in  the  ith  neuron  in  the  Ith 
layer  and  the  output  of  the  same  be  denoted  by  t/,-  arrd  xj,  respectively.  Thus 
and  j/f  are  the  fth  input  and  the  fth  output  of  an  MLP  with  L  —  I  hidden  layers. 
The  outputs  of  the  MLP  are  obtained  by  the  forward  propagation  through  each 
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aid) 


Figure  3.2:  The  general  feedback  configuration  considered  in  the  chapter, 
layer: 

Vi  =  + 

j 

=  a(y|), 

for  I  =  1,  •  •  • ,  L,  where  a  is  the  neuron  activation  function  and  is  the  weight 
from  the  jth  neuron  in  the  (/  —  l)th  layer  to  the  zth  neuron  in  the  /th  layer. 

The  general  feedback  configuration  considered  in  this  paper  is  shown  in  Fig.  3.2. 
It  hcis  j  +  k  output  nodes,  among  which  ai(t  +  1),  ■  ■  • ,  a*:(t  +  1)  are  teacher-forced 
outputs  and  -f  1),  •  •  •  ,l3j{t  -Hi)  are  free  outputs  at  time  t  +  1.  The  MLP  is 
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trained  to  have  only  the  teacher-forced  outputs  mimic  the  desired  outputs.  All 
the  free  outputs  and  i  of  the  teacher-forced  outputs,  say  Qi(i  -|-  -f 

1),  are  delayed  by  one  unit  of  time  before  being  fed  back  to  the  input  nodes, 
”  1  ■  ■  ■  5  addition,  the  MLP  has  m  input  nodes  on  which 

the  external  inputs  to  the  MLP,  7i(t  -h  1),  •  •  • ,  'fmit  +  1),  are  clamped. 

As  discussed  in  the  preceding  section,  by  either  computer  simulation  or  ex¬ 
periments,  we  may  have  available  a  finite  set  S  of  signal/mecisurement  sequences, 
{x{t,u)),y{t,u))),  t  =  l,---,r,  u)  6  S.  It  is  assumed  that  the  set  5  is  a  random 
sample  and  adquately  reflects  the  joint  probability  distributions  of  the  signal  and 
mecisurement  prosesses  x(i)  and  y{t). 

Suppose  that  -ip{x{t))  =  [i/>i(x(t)),  •  •  • ,  i/’fc(x(<))]^  are  what  we  want  to  esti¬ 
mate  and  are  thus  the  desired  outputs  for  the  actual  outputs  ai(t),  •  •  • ,  ait(t). 
Using  the  m  components  of  the  measurement  vector  y{t)  =  [2/i(0)  •  • '  i2/m(0]^  ^ 
the  external  inputs,  7i(0) '  •  •  1 7m(0i  the  training  data  consists  of  the  I/O  pairs 
{y{t,u),xl;{x{t,u))),  t  =  I,- ■  ■  u  E  S. 

The  training  is  to  minimize,  by  the  variation  of  the  MLP  weights  w,  the  mean 
square  error 


C'(^)  =  y/im  XI  X  (3-1) 

^  w€S  t=i  ;=i 

where  #5  is  the  number  of  elements  in  S  and  ai{t,u!)  is  the  actual  MLP  output 
corresponding  to  the  input  sequence  j/(t,  w),  r  =  1,  •  •  • ,  f,  at  time  t.  We  note  here 
that  is  a  function  of  j/(r,  w),  r  =  1,  •  •  •  ,i. 

A  good  training  method,  which  is  used  in  our  simulation,  is  first  to  evaluate 
the  gradient  of  the  above  error  function  with  respect  to  w  by  the  time-dependent 
recurrent  backpropagation  by  Werbos[27]  and  Pearlmutter[20]  and  then  to  use  it 
in  a  conjugate  gradient  algorithm  for  optimization. 

An  adequately  trained  network  through  minimizing  C(w)  will  be  called  a  neu¬ 
ral  Alter. 

3.2.  Convergence  to  the  Minimum  Variance  Filter 

The  collection  of  all  A;-dimensional  random  vectors  x  =  [xi,  ■  •  •  ,Xfc]^  is  a  Hilbert 
space  1/2  with  the  norm  F[||  x  |p]  =  xf],  where  E  denotes  the  expectation 

or  the  average  over  the  sample  space  fl  with  the  cr-field  A  of  events  and  the 
probability  measure  P. 

By  simple  calculation,  we  have  for  any  Borel- measurable  function  /  of  y^  :  = 
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1  ^ 

c  1=  o 

t=l 

=  ^E{E(iiwx(i))-E('^w<))i»‘iin 

t=l 

+£|||  E{i,{xmv‘]  -  M)  ip]},  (3.2) 

which  shows  that  the  conditional  expectation  E[4){x{t))\y^]  is  the  minimum  vari¬ 
ance  estimate  of  tf){x{t))  given 

Under  the  assumption  that  averaging  over  S  is  indistinguishable  from  taking 
expectation  over  U,  we  see,  by  comparing  C{uj)  in  (3.1)  and  C  in  (3.2),  that 

1  ^ 

(jj^S 

^  ^  **  “  '^(V’(^(O)iy  V)]  ir 

tjj^S 

Training  amounts  to  minimizing  the  second  term  above.  The  objective  is,  of 
course,  to  minimize  (l/T)  iB[|l  oc{t)  —  t/)(x(<))  ||^]. 

Recall  the  neural  filter  architecture  described  in  Sec.  2.  We  observe  that  adding 
a  neuron  to  a  hidden  layer  decreases  minu,C'(io)  and  so  does  adding  a  feedback, 
free  or  teacher-forced.  We  are  now  ready  to  state  and  prove  the  main  theorem  of 
this  paper. 

Theorem.  Consider  the  n-dimensional  random  process  x{t)  and  the  m-dimensional 
random  process  y(<),  t  =  1,  --,T  defined  on  a  probability  space  {Q,,A,P).  As¬ 
sume  that  the  range  A  :=  {y{t,(jj)\t  =  1,  •  ■  • ,  T,  a;  €  U}  C  is  compact  and  ip 
is  an  arbitrary  A:-dimensional  Borel  function  of  x(t)  such  that  the  second  moments 
£J[||  ip{x{t))  Ip],  t  =  1,  •  •  • ,  T,  are  finite.  Let  a{t)  denote  the  /j-dimensional  output 
at  time  t  of  the  neural  filter  which  has  taken  the  external  inputs,  j/(l),  •  •  •  ,y{t), 
in  the  given  order. 

(a).  Given  e  >  0,  there  exist  a  neural  filter  of  the  architecture  described  in  Sec.  1 
such  that 

iE'5[ll  q(1)  -  n<e. 

t=l 
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(b).  If  the  MLP  in  the  neural  filter  described  in  Sec.  2  hcis  a  single  hidden  layer 
with  N  neurons,  i  =  0  teacher-forced  feedback  nodes,  and  j  =  m{T  —  1)  free 
feedback  nodes  and  a{t)  is  written  as  a(t;  N)  here  to  indicat  its  dependency 
on  iV,  then 


r{N)  :=  mjn  i  111  (3-3) 

t=l 

is  monotone  decreasing  and  converges  to  0  as  iV  approaches  infinity. 

Proof.  Since  (a)  is  an  immediate  consequence  of  (b),  we  shall  only  prove  (b). 
It  is  obvious  that  r{N),  N  =  1,2,  •••,  are  all  bounded  from  below  by  0,  the 
sequence  r[N)  converges  and  we  denote  the  limit  by  r(cxD).  To  prove  that  r(oo)  = 
0,  it  suffices  to  show  that  for  any  e  >  0,  there  is  an  integer  M  such  that  |r(M)|  <  e. 
This  will  be  shown  in  the  following  for  the  case  in  which  the  measurement  process 
y{t)  is  scalar- valued,  i.e.  m  =  1.  The  proof  for  m  >  1  is  similar  and  omitted. 

As  illustrated  in  Fig.  3.3,  the  T  —  I  free  feedback  loops  can  be  so  constructed 
that  for  T  =  l,  -,r-  1, 

I3r{t  -f  1)  =  a(/3,_i(0),  t  =  0, ,  •  •  • ,  r  -  1.  (3.4) 

The  free  feedback  output  nodes  are  initialized  by  0t{O)  =  0,  r  =  l,---,T— 1.  At 
time  t  -f- 1,  the  free  feedback  input  nodes  are 

(^T-i(0i  ■  ■  •  1  A+i(0)  A(0)/^t-i(05  •  •  •  •> 

=  (0,  •  •  • ,  0,  a<>'(j/(l)),  a°(‘-'>(t/(2)),  •  ■  • ,  a(j/(<))), 

where  denote  the  t-fold  composition  a  o  a  o  •  •  •  o  a  of  a.  Including  both  the 
feedback  input  nodes  and  the  external  input  node,  we  have  the  T-dimensional 
MLP  input  vector  at  time  t  -f  1, 

z{t  +  1)  =  [0T-i{t),  ■  ■  ■  -f  1)]^. 


Recall  that 


at  +  1)  :=  Emx{t  +  l))|a°‘(y(l)),  a°(‘-‘)(y(2)),  •  •  •  ,a(y(t)),y(t  +  1)] 

is  a  Borel  measurable  function  of  a°\y{l)),  -  •  ■  ,a{y{t)},  and  y{t  +  1).  Let  us 
denote  it  by  -f  1)  =  At+i(a°*(y(l)),  •  •  • ,  a(y(t)),y(t  +  1)).  Now  let  us  consider 
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Figure  3.3:  A  special  feedback  configuration  used  in  the  proof  of  the  main  theorem 
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the  mapping  /  :  ^  t— »•  77  =  defined  by 

V  -  •  •  lix) 

if  6  =  •  •  •  =  <fr-i  =  0,(fr  ^  0  and  Ai(^t)  is  defined; 
'^2(<tT-i,  ^r))  if  6  =  •  •  •  =  iT-2  =  0,  (t-1  7^  0,  ^  0 

and  A2((fT-ii^r)  is  defined; 

At(6)  •  •  •  5<fr),  if  6  7^  Oj  •  •  •  )^r  7^  0  and  •  •  ,^1)  is  defined; 

0,  otherwise. 

Let  us  consider  also  the  measures  /i(  defined  on  the  Borel  sets  in  by 

f  fP{^\y{^)  ^  d^x), 

X  X  •  •  •  X  d^r)  =  \  if  0  G  d^t,  (or  t  —  1,  •  •  • ,  T  —  1, 

0,  otherwise; 

f  T^(^|a(y(l))  G  d^T-i,y(2)  G  d^r), 

722(^6  X  X  •••  X  d^x)  =  <  if  0  G  d6,  for  f  = 

y  0,  otherwise; 

Md6  X  d6  X  •••  X  d^x)  =  ^P(u;|a°(^"^)(y(l))  G  d^i,---, 

a(y(r-l))Gdex-i),y(T)Gdex). 

Notice  that  fi  =  fii  +  •  ■  ■  +  fix  is  also  a  mesure  on  the  Borel  sets  in  and  that 

J  ll/(6,---,^r)irM^6xd6x.--xd^x) 

~  T^J  X  d(f2  X  •  •  •  X  d^x) 

+  J  P2(^r-l,'fT)||^^2(d(fi  X  d^2  X  •  •  •  X  d^x) 

+  •  *  * 

+  j  y  d^2  ^  ^  ^ir)) 

=  ^{£(l|75IV’(x{l))Wl)l||'l  +  B|||i;Wx(2))|«(!,(l)),!,(2))in 
+  •  ■  ■  +  £:[||£;(V.(x(r))|a”<’'-')(!,(l)),a"l’'-='(!/(2)),  ■  ■  ■  ,!/(r)||P|} 
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<  i{£[£(||l(’(x(l))in!/(l)ll  +  £[E|||.A(x(2))|na(y(l)),  <,(2)]] 

=  i{£|||V-(x(l))|fl  +  £(||^(x(2))||’]  +  ■  ■  ■  +  £lW(x(r))||^|} 

<  oo. 

Since  y(t),  t  =  1,  •  •  • ,  T,  are  assumed  to  have  compact  ranges,  it  is  easy  to  see  that 
a°*(i/(r)),  t  and  r  =  1,  •  •  • ,  T,  all  have  compact  ranges.  Hence  there  is  a  compact 
set  W  in  such  that  fi{W)  =  1. 

Consider  now  the  normed  space  L2{R^ i  fi)  with  the  norm  defined  by  [J  ||s'(^i,  •  •  • , 
<fT)|P  X  •••  X  for  each  A:-dimensional  function  g  G  Z/2(/?^,/i).  By 

Corollary  2.2  of  the  paper  [9]  by  Hornik,  Stinchcombe  and  White,  the  /c-dimensional 
functions  represented  by  all  the  MLPs  with  k  output  nodes  and  a  single  layer  of 
hidden  neurons  are  dense  in  ■,  g) ■  It  then  follows  that  for  any  e  >  0,  there 

is  such  an  MLP  that  the  function  g{^i,  ■  ■  ■  that  it  represents  satisfies 

/  11/(6, •  ••,&)- 9{f  1, •  ■  ■ , 6)11  V('i6  x---xdiT)<e. 

Let  us  translate  the  left  hand  side  of  this  inequality  from  back  to 

the  probability  space  {Cl,  A,  P): 

J  11/(6,  ■  •  • ,  ^t)  -  gitu-  •  • .  ^t)H ^(^^6  X  d6  X  •  ■  •  X  d^r) 

=  f{j  ll'^i(^T)-5(0.---,0,,fr)||Vi(‘f6  xd6  X  •••  X  d^T) 

+  J  ||A2(^T-i,^r)-60,---,0,^T-i,^T)||V2(^f6xd6x..-xd^r) 

. 

+  J  ||AT(6,---,^T)-5(6,---,^T)||V(‘f6  xd6  X  •••  xd^r)} 

=  i{E[||E[^(x(l))|y(l)]  -  giO,  •  ■  • ,  0,  y(l))ll^] 

+£;[||£;[V^(x(2))|a(y(l)),i/(2)]-^(0,---,0,a(y(l)),y(2))||'] 

+  *  •  • 

+E[\\E[^{x{T))\a°^^-^\y{l)),a°^^-^\y{2)),  -  ■  ■  ,y{T)]  -  g{a°^'^-^\y{l)), 
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1  ^ 

=  fYl  -  9{0,  •  •  • ,  0,  a°<‘"^>(j/(l)),  a°(‘-^>(y(2)),  •  •  • ,  y(0)||^] 

t=l 

where  the  Icist  equality  holds,  because  the  tr-fields  generated  by  and  {a°^‘“^*(j/(l)), 
aO(<-2)  (j/(2)),  •  •  • ,  y{t)}  are  identical  due  to  the  monotonocity  of  a. 

We  notice  that  the  foregoing  approximation  network  g{^i,  ■ ' '  i^t)  and  the 
feedback  network  (3.4)  discussed  earlier  form  an  MLP  with  a  single  hidden  layer. 
The  number  N  of  hidden  neurons  of  this  MLP  is  the  sum  of  those  of  g{^i,  •  ■  •  ,(t) 
and  of  (3.4).  Recalling  (3.3)  we  have 

1  ^ 

’■{N)  <  -^£(||£Wi(())l!/'l-s(0,---,0,a“l'-‘l(!,(l)), 

t=l 

since  r{N)  minimizes  the  error  function  in  (3.3)  by  the  variation  of  the  weights 
of  the  MLP  of  exactly  the  same  architectures  as  that  for  g.  This  completes  the 
proof.  □ 


In  real-world  application  of  a  filter,  the  time  interval  in  which  the  filter  is 
applied  is  usually  very  long,  i.e.  T  ^  1.  Having  T— 1  feedback  loops  is  impractical. 
It  is  perhaps  cis  impractical  as  having  infinitely  many  neurons.  Nevertheless,  the 
above  theorem  guarantees  you  may  get  arbitrarily  close  to  the  minimum  variance 
filter,  provided  that  sufficient  numbers  of  feedbacks  and  neurons  are  used. 

In  fact,  the  number  of  feedbacks  needed  depends  very  much  on  the  properties 
and  the  probability  distributions  of  the  signal  and  measurement  processes,  x{t) 
and  j/(t),  individually  and  jointly.  After  adequate  training,  the  feedbacks  actually 
carry  the  “most  informative”  a  priori  statistics  at  each  time  instant  for  producing 
the  “most  accurate”  a  posteriori  estimates  required  at  the  next  time  instant.  For 
instance,  if  x{t)  and  y{t)  satisfy  (1.1)  and  (1.2)  with  /  and  h  being  linear  in 
x{t),  G  constant,  and  a:(0),  v{t)  and  w{t)  Gaussian,  the  feedbacks  needed  for 
minimum  variance  filtering  are  the  conditional  mean  and  covariances  of  x(t)  given 
?/‘,  regardless  of  T. 

3.3.  Numerical  Examples 

Four  examples  are  worked  out.  The  first  two  show  that  neural  filters  with  var¬ 
ious  combinations  of  free  and  teacher-forced  feedbacks  outperform  the  extended 
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Kalman  filter  (EKF)  and  the  iterated  extended  Kalman  filter  (lEKF)  for  nonlin¬ 
ear  systems  described  by  (1.1)  and  (1.2).  It  is  also  shown  that  free  feedbacks  are 
better  than  teacher-forced  feedbacks  and  the  neural  filters  with  only  free  feedbacks 
converge  rather  feist  for  the  example  systems.  By  the  main  theorem,  they  convege 
to  the  minimum  variance  filter.  The  third  shows  that  the  performance  of  a  neural 
filter  is  indistinguishable  from  that  of  the  Kalman  filter  for  a  linear  system.  In 
the  Icist  example,  neither  the  signals  nor  the  measurements  can  be  described  by 
(1.1)  and  (1.2).  Nevertheless,  the  neural  filters  work  satisfactorily. 

In  describing  the  signal/sensor  systems  in  the  following  examples,  the  symbols 
w{t)  and  v{t)  denote  statistically  independent  standard  white  Gaussian  sequences 
with  mean  E[w{t)]  =  £^[u(0]  =  0  variance  E[w{t)'^]  =  E[v{t)'^]  =  1.  The 
training  data  for  each  example  are  obtained  by  simulating  the  system  and  consist 
of  batches  of  200  realizations  of  100  consecutive  measurements  and  signals.  A 
batch  is  used  in  a  training  session  of  100  epochs  (or  sweeps).  If  the  convergence 
of  the  weights  is  not  satisfactory,  a  different  batch  is  used  in  another  training 
session.  Continuing  in  this  manner,  we  stop  the  training  whenever  the  convergence 
is  reached,  i.e.  the  change  in  the  mean  square  error  C{w)  is  less  than  10“*°C(iu). 

In  training,  we  first  calculate  the  ordered  partial  derivatives,  d'^C f dyi{t,u))^ 
d'^C and  d'^C /dipi{x{t,u)),  by  the  time-dependent  recurrent  backpropa- 
gation  (TDRBP)  derived  by  Werbos[27]  and  Pearlmutter[20].  Then  the  gradient, 
dC/dw,  is  evaluated  and  used  in  a  conjugate  gradient  method  for  minimizing 
C{w). 

After  training,  500  Monte  Carlo  test  runs  were  performed  for  each  filter  that 
is  considered  in  each  example.  The  RMSE  of  each  filter  considered  versus  time  is 
plotted  for  120  time  points.  In  the  last  example,  the  true  signal  and  the  estimate 
produced  by  the  neural  filter  in  a  single  run  are  also  plotted  versus  time.  The  last 
20  time  points  are  included  to  demonstrate  the  generalization  ability  of  neural 
filters. 

In  the  following,  we  shall  denote  a  neural  filter  with  m  teacher-forced  feedbacks 
and  n  free  feedbacks  by  NF(m,n). 

Example  1.  The  signal/sensor  system  is 

x{t  +  1)  =  1.1  exp(— 2x^(f))  —  1 -h  0.5u;(f), 
y{t)  =  x^{t) +  QAv{t), 

where  x(0)  is  Gaussian  with  mean  —0.5  and  variance  0.1^.  Note  that  x{t  -|-  1)  = 
1.1  exp(— 2x^(f))  —  1  hcis  a  global  attractor  at  x{t)  =  0.0844. 

The  EKF  fails  badly.  The  RMSEs  versus  time  for  the  neural  filters  with  1 
feedback,  2  feedbacks  and  3  feedbacks  are  plotted  in  Figs.  3.4  -  3.7,  respectively. 
All  possible  combinations  of  the  teacher-forced  feedbacks  are  included  for  each 


39 


F30602-91-C-0033 


Maryland  Technology  Corporation 


Figure  3.4:  RMSEs  versus  time  of  lEKF  (••Oi  NF(1,0)  (— )  and  NF(0,1)  ( — ) 
for  Example  1.  The  RMSEs  of  them  over  100  time  points  are  0.2800,  0.2200  and 
0.2106,  respectively. 
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Figure  3.5:  RMSEs  versus  time  of  NF(2^0)  NF(1,1)  (— )  and  NF(0,2)  ( — ) 

for  Example  1.  The  RMSEs  of  them  over  100  time  points  are  0.2292,  0.2112  and 
0.2103,  respectively. 
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Figure  3.6:  RMSEs  versus  time  of  NF(3,0)  (-•-),  NF(2,1)  (•  •  •),  NF(1,2)  (--)  and 
NF(0,3)  ( — )  for  Example  1.  The  RMSEs  of  them  over  100  time  points  are  0.2369, 
0.2124,  0.2121  and  0.2096,  respectively. 


RMSEs 


RMSEs 


Figure  3.8:  RMSEs  versus  timeof  EKF  (-•-),  lEKF  (•  •  •),  NF(1,0)  ( — )  and  NF(0,1) 
( — ■)  for  Example  2.  The  RMSEs  of  them  over  100  time  points  are  0.3371,  0.2807, 
0.1942  and  0.1801,  respectively. 


number  of  feedbacks.  The  results  show  that  the  more  free  feedbacks  there  are  in 
the  combination,  the  lower  the  RMSE  is  for  a  fixed  total  number  of  feedbacks. 
Figure  3.7  shows  that  the  performance  of  the  neural  filter  with  only  free  feedbacks 
converges  as  the  number  of  feedbacks  increases. 

Example  2.  The  signal/sensor  system  is 

a:(<  +  1)  =  1.7exp(— 2x^(f))  —  1  +  0.1u;(f), 
y{t)  =  x^(<)  +  0.1t;(f), 

where  x(0)  is  Gaussian  with  mean  zero  and  variance  0.5^.  Note  that  x{t  +  1)  = 
1.7  exp(—2x^(^))  — 1  is  a  chaotic  process. 

The  RMSEs  versus  time  for  the  neural  filters  with  1  feedback,  2  feedbacks  and 
3  feedbacks  are  plotted  in  Figs.  3.8  -  3.11,  respectively.  All  possible  combinations 
of  the  teacher-forced  feedbacks  are  included  for  each  number  of  feedbacks.  The 
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Figure  3.9:  RMSEs  versus  time  of  NF(2,0)  (•  •  •),  NF(1,1)  (— )  and  NF(0,2)  ( — ) 
for  Example  2.  The  RMSEs  of  them  over  100  time  points  are  0.1813,  0.1768  and 
0.1766,  respectively. 
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Figure  3.10:  RMSEs  versus  time  of  NF(3,0)  (-•-),  NF(2,1)  (•  •  •),  NF(1,2)  (— )  and 
NF(0,3)  ( — )  for  Example  2.  The  RMSEs  of  them  over  100  time  points  are  0.1836, 
0.1833,  0.1774  and  0.1764,  respectively. 
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Figure  3.11:  RMSEs  versus  time  of  NF(0,1)  (•  •  •),  NF(0,2)  (— )  and  NF(0,3)  (— ) 
for  Example  2.  The  RMSEs  of  them  over  120  time  points  are  0.1786,  0.1767  and 
0.1763,  respectively. 
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Figure  3.12:  RMSEs  versus  time  of  the  Kalman  filter  ( — )  and  the  NF(0,2)  ( — ) 
for  Example  3.  The  RMSEs  of  them  over  120  time  points  are  0.3549  and  0.3555, 
respectively. 


results  show  that  the  more  free  feedbacks  there  are  in  the  combination,  the  lower 
the  RMSE  is  for  a  fixed  total  number  of  feedbacks.  Figure  3.11  shows  that  the 
performance  of  the  neural  filter  with  only  free  feedbacks  converges  as  the  number 
of  feedbacks  increases. 

Example  3.  The  signal/sensor  system  is 

a;(t  +  l)  =  0.9x(t)  +  0.2it;(t), 
y{t)  =  x{t)  +  v{t), 

where  x(0)  is  Gaussian  with  mean  zero  and  variance  1^. 

The  RMSEs  versus  time  for  the  Kalman  filter  and  NF(0,2)  are  shown  in 
Fig.  3.12.  We  note  that  the  two  lines  are  virtually  the  same. 
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Figure  3.13:  The  true  signal  (— )  and  the  estimate  (— )  produced  by  the  NF(0,3) 
versus  time  for  Example  4. 


Example  4.  The  signal/sensor  system  is 

x(t  +  l)  =  0.5a:(t)  +  0.5tanh(x(f)  +  0.5u;(i)), 

y{t)  =  x{t)  +  ^.^x^{t)v{t), 

where  x(0)  is  Gaussian  with  mean  zero  and  variance  0.5^.  Note  that  neither  the 
signal  nor  the  measurement  process  can  be  transformed  into  (1.1)  or  (1.2). 

The  true  signal  and  its  estimates  produced  by  the  NF(0,3)  in  a  single  run 
are  shown  in  Fig.  3.13.  The  RMSEs  versus  time  for  the  NF(0, 1),  NF(0,2),  and 
NF(0,3)  are  plotted  in  Fig.  3.14,  which  show  that  the  performance  of  the  neural 
filter  with  only  free  feedbacks  converges  as  the  number  of  feedbacks  increases. 
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Figure  3.14:  RMSEs  versus  time  of  NF(0,1)  (•  •  •),  NF(0,2)  (— )  and  NF(0,3)  (— ) 
for  Example  4.  The  RMSEs  of  them  over  120  time  points  are  0.0568,  0.0567  and 
0.0565,  respectively. 
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4.  Range  Extenders  and  Reducers  for  Neural  Filtering 

A  fundamental  requirement  in  the  neural  network  approach  to  optimal  filtering  is 
that  the  mecisurement  process  stay  in  a  bounded  region.  In  theory,  the  require¬ 
ment  is  always  fulfilled,  since  no  measurable  quantities  in  the  real  world  can  not 
be  contained  in  a  bounded  region  sufficiently  large.  However,  if  the  measurement 
process  or  the  signal  process  or  both  keep  growing  such  as  in  a  typical  filtering 
problem  in  satellite  orbit  determination,  aircraft/ship  navigation,  or  target  track¬ 
ing,  for  a  recurrent  neural  network  (RNN)  to  have  a  sufficient  valid  domain  to 
cover  the  range  of  measurements  and  to  have  a  sufficient  valid  range  to  cover  the 
range  of  signals,  the  sizes  of  the  RNN  and  the  training  data  set  must  be  large. 
The  larger  the  RNN  and  the  training  data  set  are,  the  more  difficult  it  is  to  train 
the  RNN  on  the  training  data  set. 

Furthermore,  the  time  period  or  periods,  over  which  the  training  data  is  col¬ 
lected,  by  computer  simulation  or  actual  experiment,  must  be  of  finite  length. 
If  the  measurement  and  signal  processes  keep  growing,  the  RNN  trained  on  the 
training  data  has  difficulty  to  generalize  beyond  the  foregoing  time  period  or  pe¬ 
riods. 

A  simple  way  to  extend  the  RNN  output  range  and  to  reduce  the  exogenous 
input  data  range  is  scaling.  We  may  multiply  the  RNN  outputs  by  a  constant 
greater  than  one  and/or  divide  the  exogenous  inputs  by  another  constant  also 
greater  than  one.  Or  alternatively,  we  may  use  the  inverse  of  a  bounded  and 
monotone  increasing  function  to  extend  (or  antisquash)  the  RNN  outputs  and/or 
use  another  bounded  and  monotone  increasing  function  to  reduce  (or  squash)  the 
exogenous  inputs.  However,  scaling  only  suppresses  or  expands  all  the  ‘‘variabil¬ 
ity.”  It  does  not  change  the  “variability.”  Consequently,  its  usefulness  is  limited, 
as  borne  out  in  our  numerical  experiments. 

Our  idea  for  extending  the  RNN  output  range  is  to  add  some  estimate  of 
each  desired  output  to  its  corresponding  RNN  output  so  that  the  RNN  output 
is  intended  only  to  approximate  the  difference  between  the  desired  output  and 
the  estimate  of  it.  The  estimate  does  not  have  to  be  very  good.  As  long  as  the 
range  of  the  differences  is  an  acceptable  range  for  the  RNN  outputs,  the  purpose 
of  the  estimate  is  served.  A  device  that  generates  such  an  estimate  of  each  desired 
output,  and  add  it  to  the  corresponding  RNN  output  is  called  a  range  extender. 

Our  idea  for  reducing  the  exogenous  input  data  range  is  to  subtract  some 
estimate  of  each  exogenous  input  from  the  exogenous  input  so  that  the  actual 
input  to  the  RNN  is  the  difference  between  the  exogenous  input  and  its  estimate. 
The  estimate  does  not  have  to  be  very  good.  As  long  as  the  range  of  the  differences 
is  an  acceptable  domain  for  the  RNN,  the  purpose  of  the  estimate  is  served.  A 
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device  that  generates  such  an  estimate  of  each  exogenous  input  and  subtracts  the 
estimate  from  the  exogenous  input  is  called  a  range  reducer. 

It  is  necessary  to  stress  here  that  the  foregoing  estimate  of  a  desired  output 
and  the  foregoing  estimate  of  an  exogenous  input  are  not  arbitrary,  they  have  to 
be  so  generated  that  the  system  consisting  of  the  RNN  together  with  the  range 
extender  and/or  range  reducer  will  have  a  filtering  performance  surpassing  that 
of  the  RNN  alone.  Such  a  system  will  be  called  a  neural  filter. 

We  will  herein  disclose  two  range  extenders  and  one  range  reducer.  With 
the  aid  of  one  of  the  two  range  extenders  and/or  the  range  reducer,  the  output 
range  and/or  the  input  range  of  an  RNN  can  be  maintained  so  small  and  the 
“variability”  in  the  inputs  and/or  outputs  of  the  RNN  can  be  maintained  so  low 
that  the  neural  filter,  resulting  from  training  the  entire  system  jointly,  outperforms 
greatly  the  same  RNN  aided  or  not  by  scaling  methods,  when  the  signal  and/or 
measurement  processes  keep  growing  in  time. 

The  first  range  extender  herein  disclosed  uses  an  extended  Kalman  filter  to 
provide  an  estimate  of  each  desired  output,  that  is  added  to  the  correspond¬ 
ing  RNN  output  to  extend  the  RNN  outputs.  Since  a  standard  mathemati¬ 
cal/statistical  model  of  the  signal  and  measurement  processes  is  needed  for  the 
extended  Kalman  filter,  the  first  range  extender  can  not  be  applied  where  such  a 
mathematical/statistical  model  is  unavailable. 

The  second  range  extender  herein  disclosed  can  be  applied,  whether  a  mathe¬ 
matical/statistical  model  for  the  signal  and  measurement  processes  is  available  or 
not.  The  second  range  extender  is  actually  an  accumulator  that  accumulates  the 
outputs  of  an  RNN  up  to  and  including  the  current  time  point  and  presents  the 
accumulated  value  as  an  estimate  of  the  desired  output  at  the  current  time  point. 
This  amounts  to  using  the  preceding  estimate  presented  by  the  accumulator  as  an 
estimate  of  the  current  desired  output,  which  is  added  to  the  current  RNN  output 
to  extend  the  RNN  output  range. 

The  range  reducer  herein  disclosed  is  a  differencer  that  evaluates  the  differ¬ 
ence  between  a  current  exogenous  input  and  the  preceding  exogenous  input.  Here 
the  preceding  exogenous  input  is  looked  up  as  an  estimate  of  the  current  exoge¬ 
nous  input,  which  is  subtracted  from  the  current  exogenous  input  to  reduce  the 
range  of  the  exogenous  inputs.  This  range  reducer  can  also  be  applied  whether  a 
mathematical/statistical  model  is  available  or  not. 
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4.1.  A  Range  Extender  by  Kalman  Filtering 

Assume  that  the  following  standard  model  for  the  signal  process  x{t)  and  mea¬ 
surement  process  y{t)  is  given: 

x{t  +  l)  =  /(x(t),t)  +  G(a:(t),t)e(i),  x(l)  =  xi,  (4.1) 

y{t)  =  h{x{t),t)  +  e{t),  (4.2) 

where  x{t)  is  an  n-dimensional  stochastic  process;  y{t)  is  an  m-dimensional  stochas¬ 
tic  process;  Xi  is  a  Gaussian  random  vector,  ^{t)  and  e{t)  are  respectively  rii  di¬ 
mensional  and  mi-dimensional  Gaussian  noise  processes  with  zero  means;  xi, 
and  £{t)  have  given  joint  probability  distributions;  and  f{x,t),  G(x,t)  and  h(x,t) 
are  known  functions  with  such  appropriate  dimensions  and  properties  that  (4.1) 
and  (4.2)  describe  faithfully  the  evolutions  of  the  signal  and  measurement. 

Consider  the  neural  filter  depicted  in  Fig.  4.1.  The  extended  Kalman  filter 
(EKF)  and  the  adder  constitute  a  range  extender.  The  EKF  uses  the  estimate 
x(t  —  1)  from  the  RNN,  the  error  covariance  matrix  S(f  —  1)  from  itself,  and  the 
measurement  vector  y(t)  to  produce  an  estimate  x(t)  of  x[t).  The  RNN  is  then 
employed  to  only  estimate  the  difference  x{t)  —  x(f),  whose  range  is  expected  to 
be  much  smaller  than  x(f),  if  x{t)  is  a  good  estimate  of  x(<).  Denoting  the  output 
of  the  RNN  by  the  estimate  x{t)  of  x{t)  generated  by  the  neural  filter  is 

x{t)  =  x{t)  + 

The  EKF  equations  are, 

=  F{t-l)E{t-\)F^{t-l)  +  G{x{t-l),i-l) 


■Q{t-\)G^{x{t-l),t-l)  (4.3) 

i(f|f-l)  =  (4-4) 

n{t)  =  +  R{t)  (4.5) 

L{t)  =  (4.6) 

x{t)  =  x{t\t  -  1)  +  L{t)[y{t)  -  h{x{t\t  -  1))]  (4.7) 


S(f)  =  J4.8) 

where  F{t  —  1)  =  {df{x,t  —  l)/5x)|3.=i(t_i),  and  =  (^^(^?  0/^^)lr=i(t|t-i)' 
The  training  of  the  RNN  has  to  take  into  consideration  the  feedback  of  x(<  -  1) 
to  EKF.  A  training  algorithm  is  provided  in  Sec.  V  for  training  a  multilayer 
perceptron  with  interconnected  neurons  (MLPWIN)  as  the  RNN  in  the  neural 
filter.  A  numerical  example  using  an  EKF  and  an  adder  as  a  range  extender  is 
given  in  Sec.  IV. 
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Figure  4.1:  Neural  filter  with  an  EKF 
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4.2.  An  Accumulator  as  a  Range  Extender 

If  the  signal  process  x{t)  consists  of  the  vector-values,  at  discrete  time  points, 
of  a  continuous  continuous-time  process,  then  the  vector  value  x{t  —  1)  and  a 
good  estimate  of  it  are  reasonably  good  estimates  of  the  vector  value  x{t).  This 
observation  motivated  a  simple,  yet  effective  way  to  extend  the  output  range  of 
an  RNN  in  a  neural  filter,  when  two  consecutive  signals,  x{t  —  1)  and  x{t),  are  not 
too  far  apart. 

Consider  the  neural  filter  depicted  in  Fig.  4.2.  An  accumulator  is  concatenated 
at  the  output  terminals  of  an  RNN.  At  each  time  point  t,  the  accumulator  adds 
the  output  vector  P^{t)  of  the  RNN  to  the  accumulator’s  output  vector  at  the 
preceding  time  point  f  —  1.  Thus  the  accumulator  accumulates  all  the  output 
vectors  of  the  RNN  from  t  =  I  onward  plus  the  initial  accumulation  denoted  by 
x(0).  Mathematically,  the  accumulator  is  described  simply  by 

x{t)  =  0^{t)  -h  x(t  -  1),  f  =  1, 2,  ■  •  • ,  T.  (4.9) 

Here,  the  RNN  actually  estimates  the  difference  x{t)  —  x{t  —  1),  which  is  expected 
to  have  a  much  smaller  range  than  does  x{t).  If  a  good  a  priori  estimate  x(0) 
is  given  of  x(0),  it  should  be  used  as  the  initial  accumulation  x(0).  Otherwise, 
the  initial  accumulation  x(0)  can  be  determined  together  with  the  weights  and/or 
parameters  w  and  the  initial  dynamic  state  v  of  the  RNN  in  minimizing  a  training 
criterion  for  the  neural  filter.  A  training  algorithm  is  provided  in  Sec.  V  for 
training  a  multilayer  perceptron  with  interconnected  neurons  (MLPWIN)  as  the 
RNN  in  the  neural  filter. 

Example  1.  We  tested  the  aforementioned  accumulator  using  the  following  signal 
and  measurement  model: 

x(t-(-l)  =  x(t)  4- 1.2sin x(t) -f- 1.21  4- 0.2^(f),  (4-10) 

y{t)  =  sin x(t)  4- 0.l£:(t),  (4-11) 

where  and  e{t)  are  independent  standard  white  Gaussian  sequences,  and 
x(0)  =  0.  Note  that  even  though  measurement  y[t)  is  confined  essentially  in  a 
compact  domain,  the  system  remains  observable  because  y{t)  is  sensitive  to  the 
change  of  x{t)  no  matter  how  large  x{t)  is.  So  we  expect  the  error  of  a  good 
estimator  to  be  bounded  as  x{t)  increases. 

The  filtering  result  of  a  neural  filter  with  an  accumulator  (NFWA)  as  a  range 
extender  was  compared  with  the  extended  Kalman  filter  (EKF),  the  iterated  ex¬ 
tended  Kalman  filter  (lEKF),  and  a  neural  filter  (NF)  using  sigmoidal  scaling  to 
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Figure  4.2:  Neural  filter  with  an  accumulator 
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Figure  4.3:  RMSEs  versus  time  of  NFWA  ( — )  and  NF(— )  for  Example  1. 

extend  the  RNN  output  range.  The  EKF  and  lEKF  fail  to  track  the  signal.  The 
RMSEs  averaged  over  500  runs  of  the  NFWA  ( — )  and  the  NF  (---)  are  plotted 
versus  time  in  Fig.  4.3.  The  RNN  in  both  cases  is  an  MLPWIN  with  a  single 
hidden  layer  of  9  fully  interconnected  neurons.  The  length  of  each  training  data 
sequence  is  100.  Both  methods  were  tested  for  150  time  points  to  assess  their 
ability  to  generalize.  With  the  same  MLPWIN  architecture,  the  NFWA  has  a 
better  performance  than  the  NF,  before  100  time  points.  After  100  time  points, 
the  latter  deteriotes  rapidly  while  the  former  remains  good. 

4.3.  A  DifFerencer  as  a  Range  Reducer 

If  the  measurement  process  y{t)  consists  of  the  vector- values,  at  discrete  time 
points,  of  a  continuous  continuous-time  process,  then  the  vector  value  y(t  —  1)  is  a 
reasonably  good  approximate  of  the  vector  value  y{t).  This  observation  motivated 
a  simple,  yet  effective  way  to  reduce  the  range  of  of  the  measurements,  when  two 
consecutive  measurements,  y{t  —  1)  and  y{t),  are  not  too  far  apart. 


57 


F30602-91-C-0033 


Maryland  Technology  Corporation 


Consider  the  neural  filter  depicted  in  Fig.  4.4.  A  difFerencer  is  concatenated 
at  the  input  terminals  of  an  RNN.  At  each  time  point  t,  the  difFerencer  subtracts 
the  preceding  measurement  vector  y{t  —  1)  from  the  current  measurement  vector 
y{t)  and  feeds  the  difference  y{t)  —  y{t  —  1)  to  the  input  terminals  of  the  RNN. 

Notice  that  if  ?/(0)  is  available  as  a  constant  vector,  the  sequence,  y('r)— t/(r— 1), 
T  =  1, 2,  •  •  • ,  f,  can  easily  be  used  to  construct  the  sequence,  j/(t),  r  =  1, 2,  •  •  • ,  f 
and  vice  versa.  Therefore,  they  contain  the  same  amount  of  information  about 
the  signal  x(f),  for  t  =  1,2,  Since  the  constant  vector  is  expected  to  be 

“memorized”  in  the  RNN  during  its  training,  the  difFerencer  is  not  expected  to 
destroy  any  information  that  the  RNN  can  not  salvage.  This  expectation  is  borne 
out  by  the  following  example,  in  which  a  difFerencer  is  concatenated  at  the  input 
terminals  of  an  MLPWIN  and  an  EKF  is  used  to  reduce  the  output  range  of  the 
MLPWIN.  The  neural  filter  consisting  of  the  MLPWIN,  the  difFerencer  and  the 
EKF  is  depicted  in  FIG.  4.5. 

We  note  here  that  a  difFerencer  concatenated  at  the  input  terminals  of  an  RNN 
does  not  require  additional  treatment  in  a  training  algorithm  for  the  rest  of  the 
neural  filter  except  replacing  y{t)  by  y{t)  —  y{t  —  1)  cis  the  input  vector  at  time  t. 

Example  2.  We  tested  the  neural  filter  depicted  in  FIG.  4.5  using  the  signal  and 
mecisurement  processes  described  by  the  model: 

x(f  +  l)  =  x(<)  +  1.2sinx(f)  +  1.21  +  0.2^(f),  (4.12) 

y{t)  =  x(f)  +  £(f),  (4.13) 

where  ^(t)  and  e(f)  are  independent  standard  white  Gaussian  sequences,  and 
x(0)  =  0.  Note  that  EKF  should  be  a  good  estimator  for  this  problem  because 
the  signal  x{t)  enters  the  measurement  y{t)  in  a  linear  manner. 

The  RMSEs  averaged  over  500  runs  are  plotted  versus  time  for  the  neural  filter 
( — )  and  the  EKF  (— )  in  Fig.  4.6.  The  RNN  in  the  neural  filter  is  an  MLPWIN 
with  a  single  hidden  layer  of  9  fully  interconnected  neurons.  The  length  of  each 
training  data  sequence  is  100.  The  neural  filter  is  tested  for  150  time  points 
to  assess  its  generalization  capability.  The  performance  of  the  neural  filter  is 
significantly  better  than  that  of  the  EKF. 

4.4.  Training  of  an  RNN  with  a  Range  Extender 

The  concatenation  of  a  range  extender  to  an  outward  output  node  of  an  RNN 
necessitates  special  treatment  in  a  training  algorithm  for  the  RNN. 

We  consider  an  MLPWIN  with  L  +  l  layers  of  nodes/neurons.  The  input  layer 
is  the  0th  layer  and  the  ouput  layer  is  the  Lth.  layer.  The  number  of  neurons  in 
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Figure  4.4;  Neural  filter  with  a  differencer 
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Figure  4.5:  Neural  filter  with  an  EKF  arid  a  differencer 

60 


6 


F30602-91-C-0033 


Maryland  Technology  Corporation 


the  /th  layer  is  n;.  The  activation  and  weighted  sum  of  the  ith  neuron  in  the  Ith 
layer  at  time  t  are  denoted  and  respectively.  The  activation  function 
of  the  tth  neuron  in  the  /th  layer  is  denoted  a|(-).  The  weight  for  feeding  ^{t) 
to  the  ith  neuron  in  the  /th  layer  at  time  t  is  denoted  w\j.  The  weight  for  feeding 
—  1)  to  the  zth  neuron  in  the  /th  layer  at  time  t  is  denoted  tn-j.  Without 
loss  of  generality,  we  assume  this  MLPWIN  has  a  single  input  node  and  a  single 
output  node. 

The  training  data  consists  of  pairing  of  measurement  y{t,uj)  and  signal  x{t,u}) 
for  t  =  1,2,  •  •  •  ,r,  where  w  €  5  and  5  is  a  collection  of  typical  realizations  of 
the  measurement  and  signal  process.  For  the  realization  u,  the  activation  and 
weighted  sum  of  the  /th  neuron  in  the  /th  layer  are  denoted  as  and 

respectively.  Obviously,  =  y{t,u)). 

The  cost  function  to  be  minimized  in  training  is, 


C  = 


1 

r(#5) 


T 

Uf^S 


(4.14) 


where  x(f,cj)  is  the  estimate  of  by  the  neural  filter.  C  may  be  written  as 

c  =  (l/(#5))  where  ^  =  {l/T) 

To  minimize  C,  dCfdw^^  and  dCldw^'^  are  usually  needed.  We  use  BPTT 
algorithm,  which  gives. 


dC 


dC^ 


and. 


dC 


dC^ 


dC^ldr}\{t,ijj)  is  obtained  by  recursive  relation, 

dC^  dC,.,  ^  ,1  dC, 


-  l,w). 


(4.15) 


(4.16) 


for  /  =  L  -  1, 1  -  2,  •  •  •  ,  1,  /  =  T,  r  -  1,  •  •  • ,  1  and  u;  €  5.  The  final  condition  is 
dC/drjl(T  +  1)  =  0. 
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4.4.1.  Training  of  an  RNN  with  an  accumulator 

With  an  accumulator  as  a  range  extender, 

x{t,uj)  =  +  x{t  -  (4.18) 

for  t  =  1,2,  ■■ -,7  and  u)  e  S.  i(0,a;)  is  the  a  priori  estimate  of  x(0,a;).  The 
recursive  formula.s  for  obtaining  dCujfdT]^{t,ui)  are. 


dC^ 

d77f(t,u;) 

dC^ 

dx{t,uj) 


dC^ 

dx{t,uj) 


YiniM), 


dC^  dC^ 

dx[t^uj)  dx[t  +  ’ 


(4.19) 

(4.20) 


for  t  =  r,T'  -  1,  •  •  • ,  1  and  w  €  S'.  The  final  condition  is  dC^ldx{T  +  l,a;)  =  0. 


4.4.2.  Training  of  an  RNN  with  an  EKF 

With  an  EKF  and  an  adder  as  a  range  extender, 

x(t,a;)  =  ^f(t,w)  +  x(t,w),  (4.21) 

for  t  =  1,2,  -  •  •  ,r  and  u  e  S,  where  x{t,u))  is  the  estimate  generated  by  an  EKF, 
which  is  a  function  of  t/(<,u;),  x{t  —  l,u;)  and  T,{t  —  l,w).  Also  note  that  £(t,u;) 
is  a  function  of  x{t  —  l,w)  and  S(t  —  l,t<.^).  The  recursive  formulas  for  obtaining 
dCuldr]i{t,uj)  are. 


dC^ 

dr}^{t,uj) 

dC^ 

dx{t,u>) 


dCu, 

dTi{t,u) 


dC^ 

dx{t^u>) 


(af)'(77f(t,w)). 


dC^  dC^  dx{t  +  ^.^)  dCu, 

dx{t,ij) dx{t  +  l,u>)  dx{t,u>)  dE(t  +  l,uj) 
d^t  +  hu) 
dx{t,u) 

dC^  di{t  +  l,uj)  dCu,  dT.jt+l^u) 

dx{t  +  l,uj)  5S(t,u;)  ■^dS(t  +  l,u;)  5S(t,u;) 


(4.22) 

(4.23) 

(4.24) 


for  t  =  r,r  -  1,  •  •  • ,  1  and  u  e  S.  The  final  conditions  are  dC^ldx{T  +  l,w)  = 

da/dE(T  +  l,a;)  =0. 
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5.  Adaptive  Neural  Filtering 

In  the  conventional  theory  of  adaptive  filtering,  the  system  model  contains  a 
parameter  process  0{t),  which  represents  the  changes  in  the  filtering  environment. 
The  parameter  process  6{t)  may  either  be  an  unknown  deterministic  function  of 
time  or  a  stochastic  process  with  known  defining  joint  probabilities.  Given  the 
value  of  9{t)  at  each  t,  the  system  model  is  a  well-defined  stochatic  system  (i.e.  all 
the  defining  joint  probabilities  for  the  signal  and  measurements  are  determined). 

In  the  context  of  neural  filtering,  a  system  model  may  or  may  not  be  available 
and  the  underlying  system  is  considered  instead.  However,  the  changes  in  the 
filtering  environment  can  still  be  represented  by  a  parameter  process  6{t).  As 
before,  6{t)  may  be  an  unknown  deterministic  function  of  time  or  a  stochastic 
process  with  known  defining  joint  probabilities.  But  6{t)  may  also  be  a  stochastic 
process  with  unknown  defining  joint  probabilities. 

The  difference  between  the  definitions  of  an  unknown  deterministic  function 
and  a  stochastic  process  with  unknown  defining  joint  probabilities  is  vague,  be¬ 
cause  hypothetically,  any  process  can  be  observed  over  a  long  enough  time  for  a 
sufficient  number  of  times  in  typical  working  environments  so  that  the  observed 
realizations  define  the  process  as  a  stochastic  process  (with  all  its  defining  joint 
probabilities).  Hence,  an  unknown  process  should  be  viewed  as  an  unknown  deter¬ 
ministic  function,  unless  the  variability  in  its  hypothetical  probability  distribution 
is  small  enough  for  this  distribution  to  be  useful. 

Whether  0{t)  is  treated  as  an  unknown  deterministic  process  or  a  stochas¬ 
tic  process  with  known  or  unknown  statistics,  two  first  questions  to  ask  before 
considering  adaptive  neural  filtering  are  the  following: 

(a) .  Can  we  simply  use  a  recurrent  neural  network  (RNN),  that  inputs  the  mea¬ 

surement  process  and  outputs  the  estimate  of  the  signal  process,  and  let  the 
RNN  learn  in  its  training  how  to  deal  with  0{t)  internally? 

(b) .  Can  we  train  and  use  an  RNN  to  estimate 

We  notice  that  in  training  either  of  the  two  RNNs  mentioned  in  these  ques¬ 
tions,  some  training  data  set,  containing  different  actual  or  simulated,  hidden  or 
explicit  realizations  of  6[t),  has  to  be  used.  The  trained  RNN  is  supposed  to  be 
optimal  for  the  distribution  of  6{t)  represented  by  these  realizations.  If  9{t)  is 
a  stochastic  process  with  reasonably  small  variability  and  the  training  data  set 
sufficiently  represents  this  stochastic  process,  the  answers  to  both  of  these  ques¬ 
tions  are  possitive  and  no  adaptive  filtering  is  necessary.  However,  if  9{t)  is  either 
an  unknown  deterministic  process  or  a  stochastic  process  with  large  variability. 
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which  is  not  fully  represented  by  the  training  data  set,  the  trained  RNN  may  be 
far  from  being  optimal  in  a  certain  usage  of  the  trained  RNN.  Moreover,  the  size 
of  the  RNN  may  have  to  be  economically  unacceptably  large  to  accomodate  the 
large  variability  of  the  distribution  of  a  stochastic  0{t).  Therefore,  if  0{t)  is  an 
unknown  deterministic  process  or  a  stochastic  process  with  large  variability,  the 
answer  to  the  above  two  questions  are  negative.  In  this  case,  the  filtering  for  the 
signal  x{t)  has  to  be  adapted  to  the  changing  6{t)  on  the  fly,  meaning  that  the 
estimation  for  e{t)  at  a  current  time  t  has  to  be  made  completely  or  essentially 
on  the  basis  of  the  recurrent  measurements  y(s),s  <  t.  Basically,  no  a  priori 
knowledge  about  e{t)  in  either  the  form  of  statistics  or  the  form  of  signal  and 
measurement  realizations  can  be  used.  This  situation  actually  defines  the  word 
“adaptation”  in  the  context  of  neural  filtering. 

For  the  simplicity  of  our  discussion,  we  will  restrict  it  to  parameter  functions 
that  are  unknown  deterministic  in  the  sequel.  Only  constant  parameter  functions 
will  be  considered.  Two  methods  of  adaptive  filtering  will  be  presented.  These 
methods  evaluate  the  maximum  likelihood  estimate  of  0{t)  and  the  minimum 
variance  estimate  of  x(t),  given  the  measurements  y^ 

In  the  rest  of  this  Chapter,  we  assume  that  the  signal  x(i)  and  measurement 
y(t)  satisfy  the  discrete-time  equation, 

+  =  /(x(t,0‘),0(t+l))  +  G(x(t,0‘),0(t  +  l))w(i),  (5.1) 

j/(t,0‘)  =  k(x(t,0%0(t))  +  v(t),  (5.2) 

where  /,  G,  and  h  may  or  may  not  be  known;  and  w(i)  and  v(t)  are  independent 
white  Gaussian  sequences  with  mean  zero  and  variances  E[w(t)w'(t)]  =  Q(t)  and 
E[v(t)v'(t)j  =  R{t). 

5.1.  Adapting  to  Unknown  Constant  Parameters 

For  a  given  value  of  0\  estimating  x{t)  given  is  a  standard  filtering  problem. 
But,  0*  is  not  given  and  0{t)  has  to  be  estimated  on  the  fly.  However,  a  neural  filter 
can  be  synthesized  in  advance  (i.e.  before  the  filter  is  put  to  use)  that  processes 
the  estimate  0{t)  of  0{t)  as  well  as  y{t)  as  measurements,  and  produces  a  minimum 
variance  estimate  x{t)  of  x{t)  for  these  measurements.  The  question  is  how  best 
to  estimate  0{t)  given  y‘  on  the  fly.  under  the  assumption  0{t)  is  a  constant  0, 
we  will  illustrate  each  of  2  adaptive  filtering  methods  by  first  describing  how  each 
component  is  synthesized,  then  showing  how  they  are  interconnected,  and  finally 
how  0  is  estimated  on  the  fly  by  minimizing  an  error  criterion.  The  interpretations 
of  these  two  methods  will  then  be  given. 
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Method  1 

(a).  Train  NNf^  as  follows: 


e- 

y(t.ej- 


NNr 


-y(t<e) 


(5.3) 


where  9  ranges  over  all  possible  values  of  6. 

(b) .  Pick  a  possible  value  of  6  and  simulate  (5.1),  (5.2)  and  N Nf^  to  get  y{t,d) 

and  for  f  =  5, 5  +  1,  •  •  •,  where  S  is  a  time  at  which  y{t.,9)  is  in 

a  steady  state.  Get  an  estimate  0  of  0  by  minimizing 
,  5+T-l 

t-s 

where  T  is  the  length  of  time  which  is  sufficient  for  a  good  estimation  of 
9.  Repeat  above  to  obtain  a  sufficient  number  of  {{y{t^9),x{t,9)^9),t  = 
5, 5  +  1,  •  •  •}  as  training  data  for  the  next  step. 

(c) .  Train  NNi  as  follows: 


y(t,6)-^ 


NN. 

X 


I - xft.ej 


(5.4) 


(d).  After  training,  (5.3)  and  (5.4)  are  run  in  parallel  with  9  determined  at  t  by 
minimizing 
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Method  2 

(a).  Train  as  follows: 


where  9  ranges  over  all  possible  values  of  6. 

(b) .  Train  NNj  as  follows; 

fty'e)  I - f(x(t,e),e)  (^-S) 

where  f{x{t,9),B)  can  be  replaced  by  +  1,6>),  if  /  is  unknown. 

(c) .  Train  N Nx  as  follows: 
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(d).  The  above  3  RNNs  being  connected  as  follows: 


the  estimation  error  covariance, 

V{9)  :=  EW)m\, 

£(«)  ;=  -  l’(y‘-\e)]', 

is  approximated  for  each  possible  value  of  9  by  the  Monte  Carlo  method 
using  NNf^  and  NNj  jointly.  An  MLP  may  serve  as  a  good  approximating 
function  here. 

(e) .  An  estimate  0  of  0  is  obtained  by  minimizing 

E  (5.9) 

T=t-r+i 

where 

£((,«)  :=  (xV,*) 

(f) .  To  retrain  NNx,  the  retraining  data  is  collected  as  follows:  For  a  possible 

value  of  6,  simulate  (5.1)  and  (5.2),  run  (5.8)  and  minimize  (5.9)  to  get  9, 
then  ((0,  y{t,  9)),  x{t,  9))  is  used  as  a  training  input/output  pair  in  retraining 


68 


F30602-91-C-0033 


Maryland  Technology  Corporation 


NNi  as  follows: 


(g).  Repeat  (e)  and  (f)  using  the  latest  version  of  NNi,  until  NNi  converges. 

Interpretation  of  Method  1 

Recall  the  known  property  of  the  innovation  process,  I{t,  6)  :=  y{t,9)-E[y{t,  0)\y^~^], 
that  for  each  0,  0)  is  a  white  Gaussian  process  with  mean  0  and  covariance  R. 

Minimizing  C(0)  is  equivalent  to  maximizing  the  conditional  likelihood  function 
for  0  given  the  measurements  y^~^ ■  Therefore,  Method  1  produce  a  maximum  like¬ 
lihood  estimate  6  oi  9  and  the  corresponding  minimum  variance  estimate  x[y\9) 
for  t/‘  and  9. 

Interpretation  of  Method  2 

The  key  question  here  is  “What  is  the  conditional  probability  distribution  of 
E[x{t,9)\y^]  -  E[x{t,9)\y^~'^]  give  y‘?”  If  the  system,  (5.1)  and  (5.2),  is  lin¬ 
ear,  i.e.  f{x{t,9),9)  =  F{t,9)x{t,9),  G{x{t,9),9)  =  G{t^9),  and  h{x{t,9),9)  = 
H{t,9)x{t,9),  the  measurement  update  equation  is 

E[xiE9M  =  E[x{t,9)\y^-^]  -f  -f  R)-^[y{E9) 

-HiE[x{t,9)\y^-^], 

-f  R) 

H,  =  H{E9), 


E[x{t,9)\y^]  -  E[x{t,9)\y^-^]  =  K{t)I{t,9) 

/r(0  = 

is  a  white  Gaussian  process  with  mean  0  and  covariance  h{t)R{t)h'{t). 

Although  in  general,  E[x{t,9)\y^]  —  E[x{t,  9)\y‘~^]  is  not  Gaussian,  the  mea¬ 
surement  update  equation  of  the  extended  Kalman  filter  for  the  system,  (5.1) 
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and  (5.2),  does  provide  a  reasonable  approximation  of  the  quantity.  Hence  it  is 
approximately  white  Gaussian  with  mean  0  and  covariance  K{t)R{t)K'{t),  where 


HI 


dh{x,  6) 


dx 


x=E[x(t,e)\y‘-^] 


In  view  of  this  approximation,  Method  2  provides  approximately  a  maximum 
likelihood  estimate  of  0  and  a  minimum  variance  estimate  of  x. 


5.2.  Adapting  to  Unknown  Dynamic  Parameters 

In  addition  to  (5.1)  and  (5.2),  we  assume,  in  this  section,  that  the  parameter 
function  6{t)  is  a  dynamic  process  satisfying  the  dynamic  equation, 

0{t  +  1)  =  p{9{t),0{t  - -m+  l),u(t)),  (5.11) 

where  m  is  an  unknown  integer,  p  is  an  unknown  function  and  u(t)  is  an  unknown 
stochastic  process  independent  of  w  and  v.  The  variabilities  of  p,  m,  and  u  from 
application  to  application  are  assumed  to  be  so  large  that  the  information  carried 
by  the  statistics  of  the  stochastic  process  6[t)  is  not  sufficient  for  synthesizing  a 
neural  filter  in  advance  to  estimate  6{t)  or  to  estimate  x(t)  without  adapting  to 
0{t).  We  note  in  passing  that  when  the  dynamic  equation  (5.11)  takes  on  the 
special  form,  0{t  +  1)  =  0(t),  the  adaptive  filtering  problem  considered  here  is  the 
same  as  that  considered  in  the  preceding  section. 

Method  3 

(a).  Train  NNj^  as  follows: 


e(t)- 

y(t,Q')- 


h 

=>h(y'\&')  i - y(t,Q) 


(5.12) 


where  0  ranges  over  all  possible  values  of  0. 
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(b).  Train  NNi  as  follows: 


(5.13) 


(c).  After  training,  (5.12)  and  (5.13)  are  run  in  parallel  as  follows  with  another 
RNN,  NNg,  whose  weights,  w,  are  adjusted  by  minimizing 

C(w)  =  ~  ^  l!,(T,r)-M!,'-'J'(io))rfl-'|!/(r,r)-A(!/'-‘,e»)l, 

t=1-T+1 


y(t.d‘) 


■yds) 


(5.14) 
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Method  4 

(a).  Train  NNf^  as  follows: 


9(0— > 
y(t,e‘)-*^ 


NNf. 

h 

>h(y,d') 


■yM) 


(5.15) 


where  a  large  variety  of  the  parameter  functions  6{t)  is  used.  The  distribu¬ 
tion  of  them  should  be  such  that  given  any  parameter  function,  the  trained 
produces  the  optimal  estimate  of  y{t,9^)  at  any  t. 

(b).  Train  NNj  as  follows: 


9(0- 

y(t.9‘)- 


NNj. 


■>f(y'.^) 


f(x(t,9‘),9') 


(5.16) 


where  f{x{t^9),6)  can  be  replaced  by  x{t  -f  1,^),  if  /  is  unknown, 

(c).  Train  NNx  as  follows: 


1 - ^(^'9) 


(5.17) 
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(d).  The  above  3  RNNs  are  run  as  follows  with  another  RNN,  NNg,  whose 
weights,  w,  are  adjusted  by  minimizing 

1  ‘ 

D{w):=-  [£{T,w)V-\T)e'{T,w)], 

T  =  t-T+\ 

where 

£{t,w)  :=  [x{y\e\w))  -  f'{y^~\§{w)),y{t,9)  -  h{y^~\ 9^ {w))]' 


and 


y(0  = 


K{t)R{t)K'{t)  0 

0  R{t) 


5.3.  Numerical  Results 

The  signal  and  measurement  equations  used  in  our  numerical  study  are 

x(t  +  l)  =  max{0x(t)(l  —  x(t))  +  O.lrw(t),  0.00990}, 
y{t)  =  4(x(0 -0.2)3 +  0.02u(0, 

where  w{t)  and  v{t)  are  independent  standard  white  Gaussian  sequences  and  9  is 
an  unknown  constant  in  the  interval  [2,3.6]. 

Only  Method  1  was  used  in  our  numerical  study.  All  the  RNNs  to  be  discussed 
have  the  architecture  2;7r:5:l.  Let  us  recall  the  steps  to  synthesize  an  adaptive 
neural  filter  by  Method  1.  First,  NNj^  is  synthesized.  Secondly,  N is  used 
to  generate  the  training  data  set  {{y{t,  9^),  x{t,  9^),  9),t  =  s,  s  +  1, ...}  for  a  given 
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sliding  window  length  T,  as  described  in  Step  (b)  of  Method  1.  Finally,  this 
training  data  set  is  used  to  synthesize  NNi. 

The  first  adaptive  neural  filter  that  Weis  synthesized  for  the  foregoing  system 
had  a  sliding  window  of  length  Ti  =  1,000.  The  histogram  oi  6  —  0  from  1,000 
trials  of  the  adaptive  neural  filter  is  shown  in  Figure  5.1.  The  second  adaptive 
neural  filter  synthesized  in  our  numerical  study  had  a  sliding  window  of  length 
Ta  =  100. 

The  histogram  of  0  —  9  from  10, 000  trials  is  shown  in  Figure  5.2.  Notice  that 
both  histograms  are  more  or  less  Gaussian  and  unbiased,  and  as  expected,  the 
standard  deviation  in  Figure  5.1  is  about  =  l/VTO  of  that  in  Figure  5.2. 

To  provide  a  benchmark  or  lower  bound  for  the  mean  square  error  of  the  adap¬ 
tive  neural  filter,  an  RNN,  NNi,  is  trained  as  follows 


e — > 
y(t.e)—> 


NN. 

X 


^x(y,ej  I - xft.ej 


Notice  that  the  true  value  9  is  used  as  an  input  to  NNi.  Thus,  if  NNi  is  prop¬ 
erly  synthesized,  x{y^,9*)  is  a  better  estimate  of  x{t,9^)  than  x{y\9*'),  which  is 
produced  by  the  adaptive  neural  filter,  and  the  mean  square  error  of  x{y\9*)  is  a 
lower  bound  of  that  of  x[y\9*'). 

The  third  adaptive  neural  filter  synthesized  in  our  numerical  study  had  a 
sliding  window  of  20.  The  mean  square  error  for  1.000  trials  of  the  adaptive 
neural  filter  is  shown  in  Figure  5.3  over  a  period  of  20  time  points.  The  difference 
between  the  mean  square  errors  of  and  x{y\9^)  for  1,000  trials  is  shown 

in  Figure  5.4.  Notice  that  this  difference  is  very  small. 

When  the  sliding  window  length  was  reduced  to  10,  this  difference  remains 
rather  small.  The  mean  square  error  for  1,000  trials  of  the  adaptive  neural  filter 
and  this  difference  between  the  mean  square  errors  of  x{y*,9^)  and  x{y^,9^)  are 
shown  in  Figures  5.5  and  5.6,  respectively. 
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X  10'® 


Figure  5.4:  The  difference  between  the  mean  square  errors  of  x{y\d^)  and  x{y\9^) 
of  the  third  adaptive  neural  filter 
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Figure  5.6:  The  difference  between  the  mean  square  errors  of  x{y\ 6^)  and  x{y*,  0^) 
when  the  sliding  window  length  was  reduced  to  10 
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6.  Differentiating  and  Pruning  Multilayer  Feedforward  Neu¬ 
ral  Networks 

Two  questions  that  arise  after  (or  before)  a  multilayer  feedforward  network  is 
trained  up  on  a  large  amount  of  data  are  the  following: 

(a) .  How  is  a  certain  output  (or  response)  variable  related  to  a  certain  input  (or 

independent)  variable? 

(b) .  Are  some  input  variables  superfluous?  Which  ones  are  they?  Can  we  elimi¬ 

nate  them? 

To  answer  these  questions,  we  need  to  find  the  structures  of  the  network  which 
serve  to  reveal  heuristically  meaningful  functional  characteristics  hidden  in  lay¬ 
ers  of  neurons  and  to  appreciate  the  separate  and  joint  effects  produced  on  the 
response  variables  by  changes  in  the  independent  variables. 

Since  the  network  training  is  actually  a  nonlinear  regression[4],  let  us  examine 
how  the  greatly  successful  linear  regression  model  is  used  to  interpret  data.  In  the 
linear  model,  the  parameters  that  summarize  data  are  the  rates  of  change  of  the 
response  variables  with  respect  to  (w.r.t.)  the  independent  variables.  These  first 
order  derivatives  of  the  response  variables  w.r.t,  the  input  variables  are  precisely 
the  heuristically  meaningful  structures  that  are  used  to  explain  data  and  select 
or  deselect  independent  variables  in  the  regression  analysis.  Similar  structures  do 
exist  in  a  multilayer  feedforward  network. 

A  multilayer  feedforward  network  is  a  multistage  mixture  of  summations  and 
compositions  of  logistic  functions  or  hyperbolic  tangents,  which  are  analytic,  i.e. 
infinitely  differentiable  and  expandable  into  power  series  converging  to  the  ex¬ 
panded  functions.  It  is  therefore  also  analytic.  It  is  known  that  the  derivatives  of 
an  analytic  function  w.r.t.  its  independent  variables  determine  how  much  linear, 
quadratic,  cubic  and  higher  order  components  there  are  in  the  function. 

It  is  proven[10]  that  a  multilayer  feedforward  network  with  a  single  hidden  layer 
and  reasonably  smooth  activation  function  is  capable  of  approximate  an  arbitrary 
function  and  its  derivatives  to  any  degree  of  accuracy.  So  it  is  justified  to  infer  the 
dependence  of  the  output  variables  on  the  input  variables  for  the  training  data  by 
differentiating  the  trained-up  approximation  network.  Obviously,  the  magnitudes 
of  these  derivatives,  especially  those  of  the  first  order,  do  provide  us  with  ample 
intuitive  understanding  of  the  function  and  guide  us  in  deselecting  independent 
variables. 

In  this  paper,  we  present  two  methods  of  evaluating  the  derivatives  of  the 
output  variables  of  a  multilayer  feedforward  network  w.r.t.  the  independent  vari¬ 
ables,  not  only  of  the  first  order  but  also  of  higher  orders.  In  analogy  to  the 
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backpropagation  training  [22,  26],  one  of  our  methods  propagates  the  derivatives 
of  the  response  variables  w.r.t.  neuron  outputs  backward  one  layer  at  a  time. 
These  derivatives  (w.r.t.  neuron  outputs)  can  be  used  to  interpret  the  relation¬ 
ships  between  neurons  of  different  layers  and  to  prune  superfluous  neurons  as 
well  as  inputs.  On  the  contrary,  our  second  method  propagates  the  derivatives  of 
neuron  outputs  w.r.t.  the  input  variables  forward  one  layer  at  a  time. 

The  two  methods  are  equivalent  when  they  are  used  to  determine  derivatives 
of  the  output  variables  w.r.t.  the  input  variables.  It  is  shown  that  they  both 
result  from  a  fundamental  chain  rule  for  differentiating  a  multilayer  feedforwrd 
neural  network,  which  is  derived  in  the  appendix.  This  chain  rule  is  also  used  to 
differentiate  the  aforementioned  derivatives  w.r.t.  the  network  weights. 

In  Sections  6. 4-6. 6  of  the  paper,  we  discuss  how  these  derivatives  are  applied 
to  pruning  the  superfluous  weights  and  input  variables:  First  the  covariances 
of  the  derivatives  and  those  of  the  weights  are  estimated;  Then  hypotheses  are 
statistically  tested  to  determine  which  inputs  and/or  weights  are  prunable.  A 
numerical  example  is  given  in  Section  6.7. 


6.1.  Differentiation  by  Forward  Propagation 

Consider  a  multilayer  feedforward  network  with  L  layers.  Let  the  weighted  sum 
in  and  the  output  of  the  i-th  neuron  in  the  /-th  layer  be  denoted  by  y\  and  x|, 
respectively.  Thus  and  zf  are  the  f-th  input  and  the  f-th  output.  The  outputs 
of  the  network  are  obtained  by  the  forward  propagation  through  each  layer: 


for  I  =  1,  •  •  • ,  L,  where  /  is  the  neuron  activation  function  and  w[j  is  the  weight 
from  the  j-th.  neuron  in  the  (/  —  l)-th  layer  to  the  f-th  neuron  in  the  /-th  layer. 
By  the  chain  rule  of  differentiation. 


where. 


5z[ 


dx‘i  dy\ 
dy\  ■  dx° 


dx'i 


dx‘-^ 
dx<j  ■ 


(6.2) 
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Thus,  given  dx\ldx°j  =  f'{yl)  •  wlj,  we  can  recursively  calculate  for 

1  =  2, ,  , 

For  the  second  order  derivatives,  we  need  to  utilize  the  propagated  first  order 

derivatives  as  follows; 


d^ldxl 


a  ( ay'  aif (i/!)i  ^ 

-  ax»>x9 '  +  dxi  '  dxi 


=  /'(si) 


dxldil  ’  dxl  dx^ 


Notice  that  due  to  the  linear  dependence  of  yl  on  x‘f.  ^ ,  we  have 


n=l  k  n=l 

To  generalize  the  above  result  to  the  A^-th  order  derivatives,  we  define  an  n- 
partition  of  a  set  {ji,  •  •  • 

Definition.  An  n-partition  of  {ji,  •  •  •  ,iAr})  which  we  denote  by  p^,  is  a  set  of  n 
non-empty  subsets  p^j  of  {ji,  •  •  •  ,i;v}  which  satisfy; 

(a) ,  completeness;  lj"=i  =  {jii’ '  ‘  Jn}] 

(b) .  exclusiveness;  p^,  fl  p!^j  =  0,  for  j  ^  i. 


We  denote  the  set  of  all  n-partitions  of  {ji,  ■  ■  ■ ,  j/v}  by  {j\,  ■  ■  ■  ,jN)- 

Using  the  chain  rule  (6.21)  with  g  =  x|  and  Zj  =  Uj,  j  =  we  can  write  down 

the  propagation  formula  for  the  A^-th  order  derivatives 
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6.2.  Differentiation  by  Backward  Propagation 

Let  us  first  see  how  the  first  order  derivatives  dxf/dx^j  can  be  calculated  recur¬ 
sively  for  /  =  L  —  1,  •  •  • ,  0.  By  the  chain  rule  of  differentiation. 


dxf 


dxf  dx‘^^ 

y'  Sxf  dx't'  ^v'k' 

4-  dx‘*'  '  dy‘»  ■  ax' 


(6.5) 


Thus,  starting  with  dxf/dx^  ^  =  fiVi')  wfl.,  we  are  able  to  obtain  dxf/dx‘j  re¬ 
cursively. 

Also,  it  is  possible  to  utilize  first  order  derivatives  obtained  in  (6.5)  to  derive 
higher  order  derivatives  also  recursively.  For  the  second  order  derivatives. 


dxjdx‘i^ 

^dx‘^dx‘+^  ^  ^ 

S  J  ^ 

s  3  s  S 


axf  df'{y‘») 
dx{+^  dx[ 


=  - _ /■'(y'+M  ■  w‘'^^  +  Y  — ^  •  w‘'^^ 

^  dx‘  dx‘+^dx‘+^  ^  +^5x1+1  ^  ^ 

5,1  J  S  I  5  ^ 


dx‘+^dx['^^ 


+E^-/"b:^') 


■  “S' 


with  the  initial  condition  d^xf /{dxj~^dx^~^)  —  ■  wfj  ■  wf^.. 

Denote  the  set  of  all  n-partitions  (defined  in  previous  section)  of  {ji,  •  •  •  ,i/v} 
as  P^.  By  the  chain  rule  (6.21)  with  g  =  xf  and  zj  =  J  ~  L ' ' '  fh® 
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backward  propagation  formula  for  the  A^-th  order  derivatives  is 

n  “S'l 

(6.7) 

where  is  the  number  of  elements  in 

6.3.  Differentiating  Partial  Derivatives  w.r.t.  Weights 

In  this  section,  we  give  the  formulas  for  calculating  dPfdw^^^  where  P  is  a  partial 
derivative  of  an  output  variable  w.r.t.  input  variables  for  a  multilayer  feedforward 
network,  and  is  the  synaptic  weight  connecting  the  q-th  neuron  in  the  (r  —  1)- 
th  layer  to  the  p-th  neuron  in  the  r-th  layer.  Also,  we  denote  the  weighted  sum 
in  and  output  of  the  i-th  neuron  in  the  /-th  layer  as  yj  and  x,,  respectively.  And 
the  2-th  input  variable  is  denoted  x^. 

We  use  three  different  schemes  to  calculate  dPIdw^^,  Forward  and  backward 
propagation  methods  can  be  derived  directly  from  forward  and  backward  prop¬ 
agation  formulas  for  the  partial  derivatives.  Furthermore,  we  propose  a  direct 
method  which,  in  some  case,  has  a  more  compact  form. 

The  forward  propagation  formula  for  calculating  dPfdw^^  is  obtained  by  dif¬ 
ferentiating  Eq.  (6.4), 


E  E(n 

n=l 


ax'+' 


E^s^  E  fiKnlj)^.'! 
+E/'”'(vi)  E 

E/'’*‘>(».')|^  E  nKn^teJi 
+E E  E 

^=1  *epir,.  ^ 
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where  {ji,  •  •  • ,  Pn  are  defined  in  Sec.  II.  We  know  that 


!r[(n4«i  =  4  E 

P9  n=l  P9  k  n=l 

n—  1  k 


dx°  ’ 

Jn 

(6.9) 

,  dxL  ^ 

(6.10) 

The  backward  propagation  formula  can  be  obtained  by  differentiating  Eq.  (6.7), 


d  d 


d  r,TT  d 


rKna^)^fl  =  E  E  £r[(n^K‘l  E 


P9  n=l  ^  Jn 


^  ^  dw^  ^ 

n  N  n 


[n/'"“W)(n>»;:.')i+E  E(n34)-; 


Z-^  '^11 

n=l  3i  kzzl 


E  £rin 

Pire^n"^0i,-,JN)  P’  «=1 


E  E  4t<n  §4)^' 1  E 

n=l  Si,-,5„  PI  k=\  pi!ePj:’(ju-,]N) 

n  N  n 


[n/‘”“W)(n  »;:.■)] +E  e 

^=1  n=:l  51  /:=! 


^2-^^  f(m{^,J(yl  +  l)  '  Qyj 

p^ep’:'Uu-,jN)  *'=1  ^  > 


„epN^  Sv^  t=l 
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Observing  that  is  independent  of  we  are  able  to  change  their  order  of 
differentiation.  This  enable  us  to  obtain  dPfdWp^  directly  without  propagation: 


d  /TT  ^  \  L  /'TT  ^  M  \ 


P9  n=l  Jn 


n=l  P® 

n=l  P 

y  (  TT  A)M 


(6.12) 


where  is  an  ordered  partition  of  {ji,  •  •  •  ,jN}  into  3  groups,  c^j,  ^3,3^ 

with  0  or  more  elements  and  •  •  • ,  jjv)  is  the  set  of  all  .  It  is  the  sum  of  the 

products  of  three  factors.  The  third  factor  can  be  obtained  using  Eq.  (6.4).  We 
rewrite  the  first  factor  using  the  chain  rule  (6.21)  with  g  =  5xf/5xp  and  =  x^, 
j  =  !,•••  ,nr, 

-  t  E  E 

n=l  Jn  P  M=:lki,-,kM  m  =  l  ’’  pj^ePj(J(jj,-JN) 


IlK  n  a?)"Ll. 


(6.13) 


where  the  first  partial  derivative  can  be  obtained  by  Eq.  (6.7)  and  the  second  can 
be  obtained  by  Eq.  (6.4).  The  second  factor  in  (6.12)  can  be  rewritten  using  the 
chain  rule  (6.21)  with  g  =  /'{y^)  and  ^  j  =  E  •  •  ■  1  ^r-i, 

TV  N  M 

(n^)/'«)  =  E  E  /'""■’(!/;)( ri"-;*-)  E 

n=l  in  M=lki,--,kM  ni=l  PM€PM(j\  ,-,Jn) 


IlK  n  s;:5)CI. 


(6.14) 


m=l  sgp.V 


where  the  partial  derivative  can  be  obtained  by  Eq.  (6.4). 
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6.4.  Estimating  the  covariance  of  the  weights 

In  order  to  test  the  hypothesis  that  the  weight  on  a  certain  connection  is  zero  and 
the  connection  can  be  severed,  we  need  the  probability  distribution  of  the  weight, 
furthermore,  these  probability  distributions  of  the  weights  determine  those  of  the 
derivative  of  all  orders  of  the  multilayer  feedforward  network,  which  are  needed 
to  decide  which  are  needed  to  decide  which  inputs  and  neurons  can  be  pruned. 
The  statistical  model  that  we  use  is  the  following: 

yt  =  f{xt,w)  +  tt,  t  =  (6.15) 

where  {xt,  yt)  is  the  t-th.  training  pair,  w  is  the  vector  of  all  the  weights,  {ej  are  iid 
normal  with  mean  0  and  covariance  V,  and  f{xt,w)  is  the  output  of  the  network 
with  weights  w  for  the  input  X(. 

Rewrite  (6.15)  as  a  vector  equation 

y  =  F{x,w)  +  t,  (6.16) 

for  y  =  [y[,  ■  ■  ■ ,  ?/^]',  F{x,  w)  =  [/(xi,  u;)',  •  •  • ,  /(xj,  w)']'  and  e  =  e^'.  Let 

w*  denote  the  weight  vector  obtained  from  training.  We  linearize  F  about  w”  and 
approximate  (6.16)  by 

y  =  F{x,w')  +  ViuF{x,w‘){w  —  w‘)  +  e. 

Note  that  V^F{x,w*)  can  be  obtained  by  the  standard  backpropagation.  The 
estimation  for  w  for  a  different  realization  of  e  and  thus  y  is  the  linear  regression 
for  the  linear  model 


y  -  F{x,w’)  =  V^F{x,w*){w  -  w")  +  e, 
which  yields  the  estimate 

w  =  w-  +  [iV^F{x,w-)y{V^F{x,w*))]-\V^F{x,w-)y{y  -  F{x,w')). 
Hence  the  covariance  of  tf)  is 

E[(u;  -  u;)(u,  -  u;)']  =  [( V^F(x,  u;*))'( V,.f^(x,  u;*))]"* ( V,.F(x,  u;*)) V 

(V,„F(x,u;*))[(V,,F(x,u;‘))'(V,,F(x,u;*))]-L(6.17) 

Since  w"  is  one  of  those  th,  we  conclude  that  w*  in  repeated  sampling  has  covari- 
ance  equal  to  the  right  side  of  (6.17). 

It  goes  without  saying  that  the  above  formula  for  the  covariance  of  ui’*'  is  valid 
only  if  the  errors  t  are  so  sufficiently  small  that  the  foregoing  linearization  is 
justified. 


88 


F30602-91-C-0033 


Maryland  Technology  Corporation 


6.5.  Estimating  the  covariance  of  the  partial  derivatives  of  a  multilayer 
feedforward  network 

The  partial  derivatives  of  a  multilayer  feedforward  network  are  functions  of  the 
weight  vector  w  of  the  network.  The  variability  of  the  estimate  w  oi  w  induces 
that  of  the  partial  derivatives  of  the  network  with  weights  set  equal  to  w. 

Denoting  by  D{w)  the  partial  derivative  under  consideration  at  a  given  input 
vector  of  the  network  with  respect  to  weights  iw,  we  linearize  D  about  w”  as 
follows; 

D{w)^  D{w^)  +  V^D{w-){w-w').  (6.18) 

In  particular,  at  an  estimate  w  of  w, 

D{w)^  D{w-)  +  V^D(w^){w-  w^).  (6.19) 

Substracting  (6.18)  from  (6.19)  yields 

D{w)  -  D{w)  =  V^D{w’){w  -  w), 

E{{D{w)  -  D{w)Y]  ^  ^^D{w*)E[{w-w){w-wy]{\/^D{w-)y.  {6.20) 

We  stress  that  the  symbol  D  denotes  a  certain  partial  derivative  at  a  given  input 
vector  throughout  the  foregoing  calculation. 

6.6.  Pruning  weights  and  input  variables  by  testing  hypotheses 

Let  us  briefly  review  the  method  of  testing  the  null  hypothesis  Hq  that  the  pop¬ 
ulation  mean  /i  is  zero  versus  the  alternative  hypothesis  Hi  that  it  is  not.  If  a 
statistic  X,  whose  mean  and  variance  in  repeated  sampling  are  jj.  and  respec¬ 
tively,  is  obtained,  the  z-statistic  2  under  which  is  the  number  of  standard 
deviations  x/s  away  from  //  =  0,  is  usually  used  for  test  the  hypothesis  Hq.  The 
observed  significance  level  is  the  chance  of  getting  a  test  statistic  as  extreme  as 
or  more  extreme  than  the  observed  one.  The  smaller  this  chance  is,  the  stronger 
the  evidence  against  the  null. 

For  instance,  if  2  =  2.00,  then  the  observed  significance  level  is  4.55%.  This 
level  is  usually  accepted  as  sufficiently  small  in  practice  and  a  2-statistic  greater 
than  or  equal  to  2.00  is  viewed  as  sufficient  evidence  to  reject  the  null  hypothesis. 

The  statistic  that  we  will  use  to  test  the  hypothesis  that  a  synaptic  weight  w[j 
is  zero  and  thus  the  connection  should  be  pruned  is  the  component  of  w  that  is 
an  estimate  of  the  synaptic  weight.  Given  the  variance  of  the  component  of  w 
as  computed  in  the  preceding  section,  we  can  evaluate  the  2-statistic  for  it.  If 
the  2-statistic  is  less  than  2.00,  we  suspect  that  corresponding  component  of  w  is 
actually  zero  and  the  corresponding  connection  should  be  pruned. 
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Similarly,  the  statistic  for  testing  the  hypothesis  that  a  derivative  D{w)  at  an 
input  vector  is  zero  is  D{w*)  evaluated  at  the  same  input  vector.  If  the  2-statistic 
D(u;*)/(the  standard  deviation  of  D{w*))  is  less  than  2.00  at  the  input  vector,  we 
suspect  that  D{w)  is  zero  there. 

6.7.  A  Numerical  Example 

We  will  now  illustrate  how  we  can  prune  weights  and  input  variables  (or  neurons) 
by  an  example.  We  consider  a  (2:2:1)  feedforward  network  with  an  identity  acti¬ 
vation  function  for  the  output  neuron.  Denoting  the  2  input  variables  by  x  and 
y,  the  network  is  trained  to  approximate  the  function  g{x,y)  =  (!-(-  sin27rx)/2 
over  the  region  [0, 1)  x  [0, 1).  The  training  data  is  obtained  by  first  selecting  400 
points  at  random  from  the  region  and  evaluating  g[x,y)  at  each  point.  Then  400 
statistically  independent  normal  random  numbers  with  mean  zero  and  standard 
deviation  0.005  are  generated.  Each  is  added  to  one  of  the  400  values  of  g{x,y). 
The  400  pairs  of  {{x,y), g{x,y))  are  the  training  data. 

Training  by  conjugate  gradient  method  brought  the  MSE  down  to  2.66  x  10"® 
consistent  with  the  standard  deviation  (0.005)  of  the  additive  error  in  the  training 
output  data.  The  weights  w*  are 

Ui*]  ^  f  0-5946  -1.1836  0.0001  ‘ 

L  O' -I  [  1.2442  -2.4851  0.0002  J  ’ 

Kv)  =  [  0-5152  -8.5174  5.3379  ]  . 

The  standard  deviations  of  w*,  which  were  calculated  by  the  formula  (6.17), 
are 

rii  ^  [  0.1363  0.2738  0.0005' 

^  OJ  [  0.0882  0.1764  0.0010  J  ’ 

[4]  =  [  0.0146  1.1031  1.8380  ]  . 

The  2-statistics  for  w  are  then 

[u;^*/sM  =  \  -4.3224  0.2017' 

I  ij/  uJ  1^  14.1022  -14.0878  0.2038  J  ’ 

=  [  35.2552  -7.7212  2.9042  ]  . 

We  notice  that  the  above  2-statistics  indicate  that  W13  and  1023  are  possibly 
prunable.  In  fact  they  are  the  weights  leading  out  of  the  input  variable  y,  which 
actually  have  no  effect  on  the  function  g(x,y)  =  (1  -|- sin  27rx)/2  that  the  network 
is  intended  to  approximate. 
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Figure  6.1:  z-statistics  for  dOjdx 

The  z-statistics  for  the  first  order  derivatives  dOjdx  and  dOfdy  at  each  of  the 
400  input  vectors  were  also  calculated,  where  O  denotes  the  output  of  the  feed¬ 
forward  network.  The  histograms  for  the  z-statistics  for  dO/dx  and  dOjdy  are 
given  in  Figs.  6.1  and  6.2  respectively.  The  z-statistics  for  80/ dy  are  distributed 
between  —0.2  and  0.15,  suggesting  that  the  input  variable  y  is  perhaps  prunable. 

6.8.  Conclusions 

If  its  activation  functions  are  analytic,  a  multilayer  feedforward  neural  network  is 
also  analytic,  whose  Taylor  series  expansions  converge  to  it.  The  backward  and 
forward  propagation  methods  of  calculating  its  derivatives  of  all  orders  facilitate 
efficiently  such  series  expansions.  These  series  expansions  provide  ample  intuitive 
interpretation  of  the  functional  relationships  hidden  in  the  training  data. 

The  covariances  of  the  weights  and  the  derivatives  allow  the  determination  of 
the  confidence  regions  and  the  testing  of  the  hypotheses  that  a  weight,  a  neuron 
or  an  input  is  prunable.  They  reflect  the  sensitivity  of  the  weights  and  derivatives 
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w.r.t.  the  noises  in  the  training  data  and  remove  the  subjectivity  in  network 
pruning. 

We  are  restricted  to  multilayer  feedforward  neural  networks  in  this  paper. 
Nevertheless,  the  results  can  be  extended  to  recurrent  neural  networks  of  all  types. 
The  extensions  will  soon  be  reported  in  a  forthcoming  paper  of  the  authors. 

6.9.  Appendix:  A  Chain  Rule  for  Differentiating  Multilayer  Feedfor¬ 
ward  Networks 

Suppose  {z\,  •  •  •  are  either  a  subset  of  {xj,  •  •  •  ,xj,^}  or  asubset  of  {j/[,  •  •  • 
and  g  is  an  A^-th  order  differentiable  function  of  {z\,  ■  •  ■  Given  r  >  I,  the 

iV-th  order  derivative  expressed  in  terms  of  the  derivatives 

of  g  with  respect  to  -  and  the  derivatives  of  (for  i  from  1  to  Ur) 

with  respect  to  x^^  as  follows; 

(na^)^=E  E  intn^)^  (6.21) 

We  refer  to  it  as  the  chain  rule  for  multilayer  feedforward  networks.  Eq.  (6.21) 
can  be  proved  by  mathematical  induction.  We  first  note  that: 

^{Pn’^lPn  ^  Pnijxr  ■  ■  Jn)} 


(6.22) 


where 


Pn 

A 

(N,m 

iPn,l  ?  *  *  *  ’  Priori  J 

N,m 

A 

/  u  {iN+i}, 

Pn,k 

1 

Pn.k'i 

otherwise. 


Suppose  (6.21)  holds,  then 


dx\  dl^E 

JN+1  n=l 
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=  E  E  ■  E  [fK  n  |r)<l 

JN+I  k=l  n  ‘=1  U6p^^,,  “ 

+EE(na^)»-  E  s^^ifKns^Ki 

n=l  fc=l  p^^P^Uu-Jn)  ‘=1  U€p^^,  “ 

=  E  E  E  incng^xj 

n=l  Ji,-,*„+i  M+t  k=l  ‘k  pS^€P^Oi,-,jAr)  ‘=1  ti€pi^,  “ 

+EE(na^)='-  E  EroHa^Ki 

n=l»i,-,a„  k=l  P^'eP^Ol.-Jw) '"=1  ‘=1  uep^;,"*  “ 

iV+1  ^  ^  ^ 

=  E  E(naF>«-  E  in<  n  a^K.l 

n=l  ,1,...,,„  fc=l  >k  p^'+'ePn^+'Ol.-JAT  +  l)  '  =  ‘  «6pi;'+' 


which  completes  proof. 


7.  Conclusions 

The  fundation  of  a  very  powerful  new  approach  to  optimal  filtering  has  been 
laid  down.  As  opposed  to  the  conventional  theories,  this  approach  is  synthetic 
in  nature.  It  synthesizes  realizations  of  the  signal  and  measurement  processes 
into  a  filter.  If  a  mathematical  model  of  the  signal  and  measurement  processes 
is  available,  realizations  of  them  can  be  simulated  on  a  computer.  If  not,  actual 
measurements  of  these  processes  can  be  used,  eliminating  the  need  to  construct  a 
mathematical  model  of  these  process. 

Mathematical  assumptions  such  as  Markov  property,  linear  dynamics,  Gaus¬ 
sian  distribution,  and  additive  noise  are  not  required  in  this  new  approach.  It 
does  not  derive  equations  or  formulas,  but  synthesizes  a  filter  through  training 
and  selecting  recurrent  neural  networks  (RNNs).  The  selected  RNN  after  train¬ 
ing,  which  is  called  a  neural  filter,  can  be  made  as  close  to  an  optimal  filter  in 
performance  as  desired. 

The  massively  parallel  structures  of  the  RNNs  allow  real-time  implementation 
of  the  neural  filters  in  many  computation-intensive  applications  such  as  in  image 
and  video  processing.  The  availability  of  analog  and  digital  neural  computing 
VLSI  chips  makes  this  real-time  implementation  simple  and  affordable. 
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There  are  still  three  problem  areeis  which  remain  to  be  explored  further.  The 
first  is  adaptive  neural  filtering.  The  initial  results  reported  in  Chapter  5  are 
exciting,  indicating  that  a  lot  more  can  be  done.  The  neural  network  approach 
is  a  very  natural  one  to  treat  adaptive  filtering,  due  to  the  learning  capability  of 
neural  networks. 

The  second  problem  area  is  the  selection  or  determination  of  an  optimal  RNN 
architecture  for  a  given  application.  Growing  and  pruning  the  neurons  and  con¬ 
nections  are  two  important  issues  to  be  studied. 

The  last,  but  not  the  least,  problem  area  is  the  training  algorithms  of  the 
RNNs.  Existing  algorithms  are  slow  and  need  large  number  of  repeated  train¬ 
ings  with  different  initial  weights/parameters  for  the  RNN  to  avoid  a  poor  local 
minimum  of  the  training  criterion.  Those  more  reliable  algorithms  also  require 
a  very  large  memory  (e.g.  128MB  or  more)  on  the  computer.  Therefore,  better 
algorithms  for  training  RNNs  are  highly  desirable. 
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